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ABSTRACT 

( 

An  architectural  configuration  for  a  Fault  Tolerant  Parallel 
Processor  (FTPP)  is  defined  to  seet  a  space  system  reliability  and 
throughput  requirement.  The  FTPP  utilizes  a  flexible  architecture  that 
consists  of  a  set  of  interconnected  clusters,  each  of  which  consists  of  a 
set  of  interconnected  processors.  FTPP  architectural  and  redundancy 
strategies  are  perturbed  to  define  a  more  reliable  system,  and 
combinatorial  reliability  models  are  developed  to  analyze  these 
perturbations.  The  perturbations  examined  include  changes  to  the  cluster 
architecture,  the  use  of  redundant  clusters,  the  redistribution  of  tasks 
among  clusters,  and  three  cluster  interconnection  schemes:  fully  linked, 
centrally  linked,  and  singly  linked.  The  results  of  these  analyses  are 
used  to  define  relationships  between  FTPP  reliability,  throughput,  and 
architecture;  which  should  be  useful  to  a  system-designer  attempting  to 
meet  a  specific  application  requirement.  ,  i  ■ 
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CHAPTER  1 


INTRODUCTION 

1.1  MOTIVATION 

Real-time  control  of  Increasingly  complex  apace  systems  requires  the 
developsent  of  faster  throughput  coaputers.  Exaaples  of  the  throughputs 
of  past  and  present  coaputers  are: 

1.  Apollo  Guidance  and  Navigation  Computer  -  10  KIPS 

2.  Space  Shuttle  Data  Hanageaent  Systea  -  400  KIPS  (11 

<1  KIPS  *  one  thousand  Instructions  per  second) 

Exaaples  of  throughput  requirements  of  projected  systems  are: 

1.  NASA  Space  Station  Requirement  -  15  HIPS 

2.  Advanced  Military  Spacecraf t/Autonomous  Interplanetary  Spacecraft 
Requirements  -  100»  HIPS  (11 

<1  HIPS  3  one  million  Instructions  per  second) 

As  spacecraft  missions  become  more  ambitious,  breakthroughs  In 
processor  throughput  technology  will  be  necessary  to  meet  the  Increasing 
demand  for  faster  throughput  coaputers.  The  throughput  speeds  of  current 
state-of-the-art  processors  range  from  3  to  5  HIPS,  and  it  Is  doubtful 
that  the  high  speed  throughput  requirements  projected  for  future  systems 
can  be  achieved  using  a  single  processor.  In  the  event  of  a  technological 


11 


leap  In  processor  throughput  capability,  designers  will  inevitably  find 
the  need  for  still  faster  throughputs.  In  addition  to  high  speed 
throughput,  processors  for  future  spacecraft  must  possess  a  reliability 
coaaensurate  with  the  real-tiae  Mission  critical  applications  they  control 
121. 

A  solution  to  the  projected  gap  between  processing  requirements  and 
capabilities  la  parallel  processing.  Parallel  processing  is  an  efficient 
■ethod  of  information  processing  that  emphasizes  the  exploitation  of 
concurrent  events  In  the  computing  process  by  demanding  the  execution  of 
many  programs  simultaneously  (31. 

Ideally,  software  partitioned  into  n  pieces  will  run  n  times  as  fast 
in  parallel;  but,  because  of  inefficient  algorithms  for  exploiting 
concurrency  in  the  computer  problem  and  of  the  idling  of  the  processor  by 
conflicts  over  memory  access  or  communication  paths,  in  actual  practice 
the  speedup  la  much  less  (31.  According  to  Hwang  and  Briggs,  estimates  of 
the  actual  speedup  range  from  a  lower  bound  of  log(n  (Minsky's  Conjecture) 
to  an  upper  bound  of  n/(ln  n),  where  n  is  the  number  of  processors  in  the 
parallel  system. 

In  order  to  meet  the  stringent  reliability  requirements  associated 
with  real-time  mission  critical  applications,  suitable  architectures  and 
redundancy  strategies  must  be  developed  to  exploit  the  redundancy  inherent 
in  parallel  systems.  Examining  these  architectures  and  redundancy 
strategies  constitutes  the  focus  of  this  thesis. 


1.2  PURPOSE 


This  thesis  seeks  to  contribute  to  the  current  effort  at  the  Charles 
Stark  Draper  Laboratory  to  develop  a  high  speed  fault  tolerant  computer. 
The  effort  nas  been  designated  as  the  Fault  Tolerant  Parallel  Processor 
(FTPP).  The  purpose  of  this  thesis  is  to  define  relationships  aaong 
reliability,  throughput,  and  computer  architecture  for  the  FTPP.  This 
will  be  done  in  the  context  of  applying  requireaents  for  a  potential  FTPP 
space  systea  application.  The  FTPP  aust  be  capable  of  aeeting  application 
reliability  and  throughput  requireaents  while  ainiaizing  FTPP  hardware 
overhead  and  required  FTPP  coaponent  reliability. 

It  is  recognized  that  deteraining  absolute  reliability  figures 
through  analysis  lacks  credibility  without  detailed  knowledge  of  coaponent 
behavior  and  past  experience  with  siailar  designs  not  available  in  the 
early  design  phase.  Analysis  can,  however,  generate  credible  relative 
reliability  figures  that  are  required  to  define  the  desired  relationships 
and  to  sake  decisions  regarding  suitable  FTPP  archictectures  and 
redundancy  strategies.  Applying  an  actual  reliability  requirement  also 
provides  an  appraisal  of  the  suitability  of  the  FTPP  concept  for  one  of 
its  potential  applications. 


1.3  METHODOLOGY 


Tha  r»Mlnd«r  of  this  thasls  will  daflna  a  baa*  1  In*  FTPP  architactura 
and  axaalna  th*  affacta  of  varloua  aodif icationa  in  ordar  to  optiaally 
aaat  tha  application  raquiraaant.  Nora  apacif lcally,  tha  thaaia  will 
procaad  aa  foil ova: 

1.  Daflna  tha  ganaral  FTPP  architactura  and  configuration 
conatralnta;  and  daflna  FTPP  coaponant  paraaatara  to  includa  Naan  Tlaa  To 
Failura  (HTTF),  raconf iguration  tlaa,  and  throughput  capability. 

2.  Daflna  tha  apaca  ayataa  application  raquiraaanta.  Tha 
application  raquiraaanta  vlll  ba  raducad  to  raqulrad  probability  of 
auccaaa  ovar  a  tlaa  parlod  and  raqulrad  throughput. 

3.  Daflna  a  baaalina  FTPP  architactura.  Tha  baaalina  FTPP 
architactura  will  ba  dafinad  uaing  tha  FTPP  ganaral  architactura  as  a 
guida  and  ba  daalgnad  to  aaat  tha  application  throughput  raquiraaant. 

4.  Ganarata  a  FTPP  raliabillty  nodal  and  calculat*  tha  raliability 
of  tha  basalina  architactura. 

5.  Analyza  tha  affacta  of  intra-cluatar  aodif icationa  to  tha 
baaalina  architactura. 

6.  Analyza  tha  affacta  of  intar-clustar  aodif icationa  to  tha 
basalina  architactura. 

7.  Draw  conclusions  ragardlng  tha  ralationahips  aaong  raliability, 
throughput,  and  coaputar  architactura  for  tha  FTPP. 


These  steps  sill  bs  addressed  in  the  following  chapters: 


Chapter  Two:  Problem  Description,  addresses  steps  1  through  7 


CHAPTER  2 


PROBLEM  DESCRIPTION 

2.1  FAULT  TOLERANT  PARALLEL  PROCESSOR  DESCRIPTION  [1,2,4] 

The  Draper  Laboratory  Fault  Tolerant  Parallel  Processor  (FTPP)  Is 
being  designed  to  achieve  high  throughput  and  high  reliability  to  aeet  the 
projected  stringent  requireaents  for  future  applications.  To  date,  Draper 
Laboratory  has  defined  a  general  architecture  for  the  FTPP  and  begun 
coaponent  breadboarding. 

The  FTPP  utilizes  a  flexible  design  concept  where  the  architecture 
aay  be  aodified  to  suit  the  application.  The  FTPP  consists  of  a  set  of 
building  blocks  arranged  in  clusters  (figure  2.1-1). 

2.1.1  Cluster  Architecture 

Each  cluster  vithin  a  FTPP  consists  of  processor  eleaents,  network 
elements,  input/output  eleaents  and  aemory  eleaents. 

PROCESSOR  ELEMENTS  (PE)  -  Processor  elements  perform  the  tasks  of 
global  controller,  cluster  controller  and  working  processor  element. 
Global  Controller  <GC)  -  Manages  inter-cluster  communications; 


the  loss  of  the  GC  implies  system  loss. 


MEMORY  ELEMENT  PROCESSOR  NETWORK  ELEMENT  I/O  ELEMENT 


Cluster  Controller  (CC)  -  Maneges  Intra-cluster  coaaunlcations; 
the  loss  of  a  CC  iapliee  cluster  loss. 

Working  Processor  Eleaent  (WPE)  -  Performs  the  coaputational 
application  tasks;  the  loss  of  a  WPE  iaplies  task  loss. 

Each  processor  eleaent  is  assuaed  to  have  an  exponential  failure 
rate,  a  reconfiguration  tiae  averaging  . 25  seconds,  and  a  throughput 
cabability  of  5  HIPS. 

NETWORK  ELEMENTS  (NE)  -  Network  eleaenta  serve  to  pass  information 
from  processing  eleaents  to  input /output  eleaents,  and  are  an 
integral  part  of  the  FTPP  architecture.  The  loss  of  a  network 
eleaent  results  in  the  loss  of  all  the  processors  connected  with  the 
network  eleaent  and  one  input/output  eleaent.  When  there  are  three 
processor  eleaents  per  network  eleaent,  each  network  eleaent  is 
assuaed  to  have  an  exponential  failure  rate  with  a  MTTF  10  tiaes 
greater  than  the  MTTF  of  a  processing  eleaent.  As  sore  processor 
eleaents  are  connected  to  a  network  eleaent,  its  increased 
complexity  results  in  an  decreased  reliability.  In  aodelling  the 
greater  complexity  of  a  network  eleaent  where  wore  than  three 
processor  eleaents  are  used,  the  failure  rate  for  the  network 
eleaent  is  assuaed  to  increase  ten  percent  for  each  additional 
processor  eleaent  added.  Preliminary  reliability  modelling  has 
suggested  five  network  eleaenta  per  cluster  [51. 


INPUT/OUTPUT  ELEMENTS  (IOE)  -  Input/output  element*  irrvt  to  paaa 
inforaation  between  cluster*  via  coaaunication  iinaa.  Input/output 
alaaanta  ara  critical  in  deteraining  tha  topology  of  a  group  of 
cluatera,  and  tha  loaa  of  an  input/output  alaaant  iapliaa  tha  loaa 
of  a  coaaunication  link.  Each  input/output  alaaant  ia  aaauaad  to 
hava  an  axponantial  failura  rata  with  a  NTTF  aqual  to  tha  HTTF  of  a 
procaaaing  alaaant.  Each  input/output  alaaant'a  failura  rata  ia 
aaauaad  to  incraaaa  an  additional  10  parcant  for  aach  coaaunication 
llna  ataaaing  froa  it. 

MEMORY  ELEMENTS  <HE)  -  Haaory  alaaanta  ara  claaaad  aa  aithar  global 
aaaory  or  ragional  aaaory. 

Global  Haaory  (GH)  -  Storaa  inforaation  garaana  to  avary 
procaaaor  alaaant'a  parcaption  of  tha  ayataa  atata,  but  which 
oftan  changaa  aa  a  raault  of  aodif icationa  or  of  updataa  by  tha 
procaaaor  alaaanta.  Global  aaaory  ia  aaaignad  to  aach  ciuater, 
and  tha  loaa  of  global  aaaory  iapliaa  ciuater  loaa. 

Ragional  Haaory  (RH>  -  Storaa  tiae-invariant  inforaation. 
Ragional  aaaory  ia  aaaignad  to  n  procaaaor  alaaanta,  and  tha 
loaa  of  ragional  aaaory  iapliaa  tha  loaa  of  tha  procaaaor 
alaaanta  aaaignad  to  tha  aaaory. 

Each  aaaory  alaaant  ia  aaauaad  to  hava  an  axponantial  failura  with  a 
MTTF  aqual  to  tha  NTTF  of  a  procaaaing  alaaant.  Lika  network 


eleaents,  the  failure  rata  increase*  by  10  percent  for  each 
additional  proceaaor  eleaent  over  three. 

2.1.2  Inter-Clueter  Connectivity 

A  FTPP  ayatea  conaiata  of  a  aet  of  Interconnected  clusters. 

TOPOLOGY  -  Topology  refers  to  the  particular  cluster  interconnection 
scheae  eaployed.  Topology  affects  reliability,  throughput, 
modularity,  and  aalntainability.  Cluster  to  cluster  connections  are 
depicted  in  figure  2.1-2.  Each  cluster  to  cluster  'link'  consists 
of  a  set  of  five  coaaun ications  'lines'  between  Input/output 
eleaents.  A  aore  detailed  discussion  of  topology  is  found  in 
section  3.  3. 

2.1.3  Systea  Failure 

Systea  failure  occurs  when  the  FTPP  can  no  longer  reliably  aeet  the 
throughput  requireaent.  More  specifically,  if  no  degradation  of  systea 
throughput  is  permitted  and  there  are  no  redundant  clusters,  systea 
failure  occurs  when  either  of  the  following  two  conditions  holds  true: 

1.  Any  cluster  is  declared  failed.  A  cluster  not  declared 
operational  is  declared  failed.  For  a  cluster  to  be  declared 
operational,  at  least  one  processor  with  access  to  working  memory 


communication 

line 


communication 
link  (5  lines) 


Figure  2.1-2:  Cluster-to-Cluster  Connectivity. 


end  network  elements  aust  be  operational  for  each  task  assigned  to 
the  cluster;  and  at  least  two  coaaunications  lines  between  the 
cluster  and  any  other  cluster  aust  be  operational.  A  coaaunications 
line  is  declared  operational  if  both  input/output  elements  and  their 
respective  network  eleaents  are  all  operational. 

2.  Any  cluster  or  group  of  clusters  is  isolated  from  the 
systea.  This  reguireaent  implies  the  system  can  be  declared  failed 


even  when  there  are  no  cluster  failures  and  no  individual  cluster 
isolations.  Figure  2.1-3  depicts  a  case  where  the  coaaunication 
links  between  Clusters  1  and  2  and  between  Clusters  4  and  5  have 
failed.  Although  there  are  no  individual  isolations,  Clusters  1,5,6 
are  unable  to  coaaunicate  with  Clusters  2, 3, 4  and  the  systea  is 
declared  failed. 


Cl  C2 


C5  C4 

Figure  2.1-3:  Systea  Failure  Due  to  Cluster  Group  Isolation. 


2. 2  APPLICATION  SPECIFICATION 


The  specification  to  bs  examined  by  this  thesis  is  stated  as  follows: 

1.  Out  of  1000  spacecraft  utilizing  the  FTPP,  no  aore  than  one 
should  be  lost  due  to  coaputer  failure  over  a  20  year  period. 

2.  The  systea  will  be  aaintained  and  fully  repaired  once  per  year. 

3.  The  systea  requires  a  coaputational  throughput  of  100  HIPS. 

The  preceding  specification  defines  the  reliability  requireaents  for 
various  coaputer  replication  levels. 

For  a  siaplex  coaputer  systea,  the  expected  nuaber  of  losses  in  1 
year  will  equal  1/20  or  .05  spacecraft/year.  Using  the  equation  for  the 
expected  value  of  a  binoaially  distributed  variable  yields: 

E(L)  *  N(QS)  (2.2-1) 

where : 

E(L)  =  Expected  nuaber  of  spacecaft  losses  due  to  coaputer  failure 
in  1  year. 

N  =  Initial  nuaber  of  operational  spacecraft. 

QS  =  Unreliability  of  the  spacecraft  computer  systea  over  1  year. 
Solving  for  OS  yields  an  unreliability  requirement  of  5. 0  x  10' 1  over  1 
year.  Since  the  system  consists  of  only  a  siaplex  coaputer,  this 
requireaent  is  the  reliability  requirement  for  that  single  coaputer. 

For  a  duplex  coaputer  systea:  using  coabinatorial  analysis  and 
assuaing  a  coverage  of  .85  for  the  first  failure  yields  the  equation: 


OS  3  (l-C)P(l  computer  fails>»P<2  computers  fail) 


(2.2-2) 


where : 


C  *  Coverage  for  a  duplex  coaputer  system. 

P(1  coaputer  falls)  =  2(1-Q)Q  where  Q  is  the  unreliability  of  a 
single  coaputer. 

P<2  coaputers  fail)  3  Q*. 

Solving  for  Q  numerically  yields  an  unreliability  requireaent  of  1.661  x 
10* 4  over  1  year. 

For  a  triplex  coaputer  systea:  using  coabinatorial  analysis  and 
assuming  a  coverage  of  1.00  for  the  first  failure  and  .65  for  the  second 
failure  yields  the  equation: 

OS  3  ( 1 -C ) P ( 2  coaputers  fail)+P<3  coaputers  fail)  (2.2-3) 

where : 

P<2  coaputers  fail)  3  3( 1-0)0*. 

P(3  coaputers  fail)  3  0* . 

Solving  for  0  numerically  yields  an  unreliability  requireaent  of  1.040  x 
10*#  over  1  year. 

For  a  quadruplex  coaputer  systea:  using  coabinatorial  analysis  and 
aaauaing  a  coverage  of  1.00  for  the  first  two  failures  and  .85  for  the 
third  failure  yields  the  equation: 

QS  3  (1-C)P(3  coaputers  fail>+P(4  coaputers  fail)  (2.2-4) 

where: 

P(3  coaputers  fail)  3  4( 1-0)0* . 


P ( 4  coaputers  fail)  3  Q4 . 


Solving  for  Q  numerically  yields  an  unreliability  requirement  of  4. 326  x 
10'*  over  1  year. 

In  suasary,  the  specification  yields  a  computer  unreliability 
requirement  which  varies  with  the  computer  replication  level: 

Computer  Unreliability  Requirement  (Simplex  Configuration)  =  5. 00  x  10' 5 

Computer  Unreliability  Requirement  (Duplex  Configuration)  =  1.66  x  10* 4 

Computer  Unreliability  Requirement  (Triplex  Configuration  >  =  1.04  x  10* * 

Computer  Unreliability  Requirement  (Quadruplex  Configuration)-  4.32  x  10*' 

The  reliability  analyses  in  chapter  3  will  utilize  a  range  of 
component  HTTFs.  Processor,  input /output,  and  memory  elements  possess 
equal  HTTFs  ranging  from  10*  to  10f  hours.  Network  elements  possess  a 
HTTF  ten  times  greater  ranging  from  10*  to  10*  hours.  Figure  2.2-1 
graphically  depicts  the  equivalent  reliability  of  cluster  components  when 
their  HTTF's  are  taken  over  a  one  year  period.  The  X-axis  scale 
represents  the  HTTF  of  the  processor,  lnput/output.,  and  memory  elements 
and  the  HTTF  of  the  network  elements  divided  by  ten.  The  X-axis  scaling 
of  HTTF  will  represent  the  same  for  all  succeeding  graphs  in  the  thesis. 
Figure  2.2-1  may  be  used  as  a  reference  to  translate  the  often  used 
quantity  of  HTTF  to  the  more  meaningful  quantity  of  reliability. 
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Figure  2.2-1:  Cosponent  Reliability  vs.  Cosponent  MTTF. 
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2.3  BASELINE  ARCHITECTURE 

The  following  aaauaptiona  ar*  Mdt  to  define  a  baeellne  architecture: 

1.  Thar*  ar*  thr**  proc***or  *l***nts  p*r  n*twork  *l*a*nt. 

2.  Th*r*  la  on*  ahar*d  aeaory  *l*a*nt  for  *v*ry  thr**  proc*aaor 
•l*a*nta.  Each  a*aory  *l*a*nt  ator**  both  global  and  regional 
a*aory. 

3.  Within  a  cluater,  th*r*  ar*  at  l*aat  thr**  proc**aor  *l*a*nta 
aaaigned  to  a  aingl*  taak.  Th*r*  ar*  at  l*aat  four  proc*aaor 
•l*a*nta  aaaigned  to  th*  global  controller. 

4.  Cluater*  ar*  fully  linked  (*v*ry  cluater  ia  connected  to  every 
other  clueter). 

5.  A  act  of  proceaaor  *l*a*nta  performing  n  teak*  poaa*aa*a  a 
throughput  n  tiaea  aa  faat  ae  a  a*t  of  proc*aaora  performing  a  aingl* 
taak  (ideal  behavior). 

The  baaeline  cluater  architecture  (figure  2.3-1)  therefor*  utilize*: 

1.  3  Input/Output  Element*. 

2.  3  Network  Element*. 

3.  13  Proceaaor  Eleaenta. 

a.  12  Working  Proceaaor  Eleaenta.  (Th*  cluater  deeignated 
global  controller  would  utilize  fi  working  proceaaor  eleaenta  and  4 
global  controller*). 

b.  3  Cluater  Controller*. 


4.  3  Memory  Eleaenta. 


(TO  OTHER  CLUSTERS) 

. . . . . imiimmimiiuiMiiimiiiiiiiHmuiiiiiliiiiiiiiimiiiiiiiiiiiiiiiiiiiiiiiiiimiiiiiiiiiiiii . . 


MEMORY  PROCESSOR  NETWORK  INPUT/OUTPUT 
ELEMENT  ELEMENT  ELEMENT  ELEMENT 


Figure  2.3-1:  Baseline  Cluster  Architecture. 


To  support  •  100  HIP  requirement  requires  20  sets  of  processors 
operating  in  parallel.  A  network  of  6  baseline  clusters  (figure  2.3-2) 
will  therefore  aeet  the  application  throughput  requirement  leaving  8 
processor  elements  as  additional  spares.  The  baseline  system  performs  a 
total  of  27  tasks:  20  computational  tasks,  8  cluster  controller  tasks,  and 
1  global  controller  task. 


CLUSTER  1: 
3CC/12WPE 
(5  TASKS): 


CLUSTER  2: 
3  CC/12  WPE 
(5  TASKS) 


CLUSTER  3: 
3  CC/12  WPE 
(5  TASKS) 


CLUSTER  4: 
3  CC/12  WPE 
(4  TASKS) 


CLUSTER  5:  CLUSTER  6: 

3  CC/ 1 2  WPE  4  GC/3  CC/8  WPE 

(4  TASKS)  (4  TASKS) 


Figure  2.3-2:  Baseline  Systea  Architecture. 


CHAPTER  3 


RELIABILITY  MODELLING 


3.1  BASELINE  RELIABILITY 


The  purpose  of  this  section  is  to  develop  a  FTPP  reliability  model 
which  generates  the  FTPP  system  reliability  as  a  function  of  the  FTPP 
component  reliabilities.  The  reliability  model  will  be  used  to  generate 
an  upper  and  a  lower  unreliability  bound  for  the  baseline  architecture 
defined  in  Chapter  2.  For  the  lower  unreliability  bound,  any  assumptions 
made  generate  the  highest  possible  reliability.  For  the  upper 
unreliability  bound,  any  assumptions  made  generate  the  highest  possible 
unreliability.  The  one  exception  to  these  rules  is  that  cluster 
isolations  due  to  combinations  of  failures  between  network  elements  and 
input/output  elements  of  different  clusters  are  neglected  for  both  upper 
and  lower  bound  calculations.  This  exception  has  little  effect  on 
reliability  calculations,  as  will  be  demonstrated  presently,  and  permits 
the  clusters  to  be  treated  as  independent  units.  Both  upper  and  lower 
unreliability  bounds  will  be  generated  using  combinatorial  and 
decomposition  techniques. 


Figure  3.1-1:  Baseline  Architecture  as  Viewed  iron  Cluster  1. 


nodes  which  would  isolate  a  cluster  would  primarily  involve  the  elements 


of  that  particular  cluster.  The  isolation  of  a  cluster  due  to  failures  in 


other  clusters  would  require  a  seemingly  unlikely  combination  of  failures 


in  all  other  clusters.  This  observation  is  applicable  for  a  cluster 


connected  to  a  'significant'  number  of  clusters  and  nay  not  be 


particularly  applicable  for  a  ring  topology  where  every  cluster  is 


connected  to  only  two  other  clusters.  The  following  analysis  calculates 


the  probability  of  cluster  isolation,  due  to  input/output  element 


failures,  in  two  ways.  First:  the  probability  that  a  single  cluster 


experiencea  an  iaolation  ia  calculated  neglecting  iaolationa  due  to 
input/output  eleaent  failurea  in  other  cluatera  and  multiplying  the 
reliabilitiea  of  the  aix  cluatera.  Since  aoae  failure  modea  are 
neglected,  the  calculated  unreliability  will  be  lover  than  the  actual. 
Second:  the  probability  that  a  cluater  experiencea  an  iaolation  ia 
calculated  by  taking  into  account  all  poaaible  failure  modea  between 
cluatera  and  multiplying  the  reliabilitiea  of  the  aix  cluatera.  Since 
aoae  failure  modea  will  be  counted  more  than  once,  the  calculated 
unreliability  will  be  higher  than  the  actual.  If  the  two  aethoda  yield 
alellar  reaulta,  the  firat  method  can  be  employed  to  eiapllfy  the  ayatem 
reliability  model  without  algnif icantly  altering  the  reaulta. 

Uaing  the  firat  method:  the  probability  of  at  learnt  one  iaolation, 
P<ISOLATION),  ia  the  compliment  of  the  probability  of  no  iaolationa,  P(0 
ISOLATIONS): 

P( ISOLATION)  «  1-P(0  ISOLATIONS)  (3.1-1) 

The  probability  of  no  iaolationa  ia  calculated  by  multiplying  the 
reliabilitiea  of  the  aix  cluatera: 

P(0  ISOLATIONS)  *  C 1 -P < 1  CLUSTER  ISOLATION))*  (3.1-2) 

The  probability  a  particular  cluater  experiencea  an  iaolation:  P(1  CLUSTER 
ISOLATION),  ia  the  probability  4  or  5  input/output  element*  fall  on  that 
cluater  aince  other  cluater  failurea  are  being  neglected: 

P(1  CLUSTER  ISOLATION)  «  ( 1-RIO)* *5< 1-RIO)4 (RIO)  (3-1-3) 


where:  RIO  ■  Reliability  of  a  aingle  input/output  element. 


The  probability  of  at  least  on*  cluster  isolation  is  nov  completely 
defined  as  a  function  of  the  reliability  of  the  input/output  elements. 

Using  the  second  method:  the  probability  of  at  least  one  isolation 
must  take  into  account  all  failure  modes.  Equations  3.1-1  and  3.1-2 
remain  valid.  To  find  the  probability  a  particular  cluster  experiences  an 
isolation,  a  decomposition  on  the  input/output  elements  of  that  cluster  is 
performed : 

P ( 1  CLUSTER  ISOLATION)  * 

■ 

EtP<  1  CLUSTER  ISOLATION/N  IQ  WORK)  x  P(N  10  WORK)]  (3.1-4) 

■  -  • 

where: 

P(N  10  WORK)  *  Probability  exactly  N  out  of  S  input/output  elements 
operate. 

P<1  CLUSTER  ISOLATIQN/N  10  WORK)  *  Probability  a  particular  cluster 
is  isolated  given  that  exactly  N  input/output  elements 
operate. 

The  probability  exactly  N  Input/output  elements  out  of  5  operate  can  be 
calculated  combinatorially : 

* 

P< N  10  WORK)  «  (. )(l-RI0)»-« <RI0>*  ;  N=0  to  5  (3.1-5) 

where: 

9 

<■ )  *  Binomial  coefficient  calculated  as  51/C  <5-N) i  Nil. 

The  probability  a  particular  cluster  is  Isolated  given  that  exactly  N 
Input/output  elements  operate  is  calculated  using  the  knowledge  that  a 
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cluster  aust  have  at  least  tvo  operational  lines  between  itself  and  at 
least  one  other  cluster.  A  single  cossunlcation  line  betveen  tvo  clusters 
consists  of  2  input/output  elements,  both  of  which  must  be  operational  for 
a  cossunlcation  line  to  be  declared  operational.  Therefore: 

P(1  CLUSTER  ISOLATION/N  IQ  WORK)  * 

P(N-1  or  N  out  of  N  10  of  all  other  clusters  fail)  0.1-6) 

where : 

P(N~1  or  N  out  of  N  10  of  all  other  clusters  fail) 

«  1  ;  H=0, 1 

t  II 

*  <£  C(, ><l-RIO>"-‘ <RIO>‘ ])•  ;  H«2  to  5  0.1-7) 

i  •  • 

The  probability  of  at  least  one  cluster  isolation  is  now  completely 
defined  as  a  function  of  the  reliability  of  the  input/output  elements. 

The  probability  of  at  least  one  cluster  isolation  was  programmed  in 
FORTRAN  using  both  methods.  A  comparison  of  the  tvo  methods  for 
input/output  element  base  NTTF's  ranging  from  10*  to  10s  hours  is  depicted 


graphically  in  figure  3.1-2.  The  base  HTTF  is  the  HTTF  of  an  individual 
input/output  element  with  no  communication  lines  emanating  from  it.  The 
graph  shows  a  maximum  error  of  about  2. 3  percent  occuring  at  input/output 
element  HTTF*25000  hours  and  subsequently  aonotonically  decreasing  to 
about  .1  percent  for  HTTF*100000  hours.  The  system  reliability  model 
generated  in  the  following  section  will  therefore  assume  that  the 
probability  that  a  particular  cluster  experiences  an  isolation  is  due 
exclusively  to  failures  of  its  own  elements.  Clusters  are  in  effect 
assumed  independent  for  the  fully  linked  system. 
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Figure  3.1-2:  Percent  Error  Incurred  Assuming  Cluster 
Independence  for  Baseline  Systea. 

3.1.2  System  Reliability  Model  for  Baseline 

The  general  procedure  employed  to  derive  a  systea  reliability  model 
is  to  find  the  reliability  of  a  single  cluster  by  decomposing  on  the 
network  and  memory  elements.  The  reliability  of  the  total  system  is  found 
using  the  calculated  cluster  reliabilities  and  the  assumption  of  cluster 
independence.  The  following  derivation  will  calculate  the  system 
reliabilty  in  terms  of  the  component  reliabilities. 

The  probability  that  the  FTPP  system  operates,  P(S>,  is  given  by  the 


following  equation: 


P(S)  «  P(Cl)xP(C2/Cl)xP(C3/Cl  n  C2)x. . . xP(C6/Cl  DC 2  ...  0  C5) 

(3. 1-8) 

where : 

P(CX)  *  Probability  cluster  X  operates. 

P(CX/CA  n  ...  n  CE)  *  Probability  cluster  X  operates  given  that 
clusters  A  through  E  operate. 

Since  cluster  failures  are  assused  to  be  independent,  the  system 
reliability  simply  becomes  the  product  of  the  individual  cluster 
reliabilities: 

P(S)  »  P  ( Cl  >  xP  ( C2 )  xP  ( C3 )  xP  ( C4 )  xP  ( C5 )  xP  ( C6 )  (3.1-9) 

P(S)  »  n  P(Ci)  (3. 1-10) 

i  •  i 

To  find  the  probability  a  cluster  operates,  P(C),  a  first  decomposition  is 
performed  on  the  network  elements: 
s 

P(C)  *  E  CP(C/X  ME  WORK ) xP ( X  HE  WORK ) 1  (3.1-11) 

I  •  0 

where : 

P(X  HE  WORK)  »  Probability  exactly  X  out  of  5  network  elements 
operate. 

P(C/X  HE  WORK)  3  Probability  a  cluster  operates  given  that  exactly 
X  network  elements  operate. 

P(X  HE  WORK)  can  be  calculated  combinatorially  using  equation  3.1-5  and 
substituting  RH  and  X  for  RIO  and  N  respectively  where  RH  is  the 
reliability  of  a  single  network  element.  The  probability  that  a  cluster 
operates,  given  that  exactly  X  network  elements  operate,  is  equal  to  the 


probability  input/output  eleaents  do  not  cause  cluster  failure,  given  that 
X  network  eleaents  operate,  and  aeaory  or  processor  eleaents  do  not  cause 
cluster  failure,  given  that  X  network  eleaents  operate.  Since  these  two 
failure  aodes  are  independent,  P(C/X  NE  WORK)  is  equal  to  the  product  of 
the  reliability  of  the  input/output  elements,  given  that  X  network 
eleaents  operate,  and  the  reliability  of  the  processor  and  aeaory 
eleaents,  given  that  X  network  elements  operate. 

P(C/X  NE  WORK)  *  PUO/X  NE  WORK)  x  P(PE  and  ME/X  NE  WORK)  0.1-12) 

where : 

P( IO/X  NE  WORK)  3  Probability  input/output  failures  do  not  cause 
cluster  failure  given  that  X  network  elements  operate. 

P(PE  and  HE/X  NE  WORK)  3  Probability  processor  element  or  memory 
element  failures  do  not  cause  cluster  failure  given  that  X 
network  eleaents  operate. 

P<IO/X  NE  WORK)  is  the  probability  that  at  least  two  input/output  elements 
have  access  to  working  network  eleaents  and  can  be  can  be  calculated 
coabinatorially : 

PdO/X  NE  WORK)  3  P(at  least  2  of  X  10  WORK)  (3.1-13) 

where: 

P(at  least  2  of  X  10  WORK) 

3  0  ;  X=0, 1 

« -  •  i 

3  E  l(,  ><1-RI0)‘ (RIO)*-‘ ]  ;  Xs2  to  5  (3.1-14) 

i  •  • 

where:  RIO  3  Reliability  of  a  single  input/output  element 
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P(PE  and  HE/X  NE  WORK)  can  be  found  by  decomposing  on  the  memory  elements: 

P(PE  and  HE/X  NE  WORK)  = 

» 

Z  IP(PE  and  HE/Y  HE  and  X  NE  WORK)  x  P(Y  HE  WORK)]  (3.1-15) 

»  •  • 

where: 

P ( Y  HE  WORK)  s  Probability  that  exactly  Y  out  of  5  memory  elements 
operate. 

P(P£  and  HE/Y  HE  and  X  NE  WORK)  =  Probability  that  processor 

element  or  that  memory  element  failures  do  not  cause  cluster 
loss  given  that  Y  memory  elements  and  X  network  elements 
operate. 

P(Y  HE  WORK)  can  be  calculated  combinatorially  using  equation  3.1-5  and 
substituting  RH  and  Y  for  RIO  and  N  respectively  where  RH  is  the 
reliability  of  a  single  memory  element.  To  find  the  P(PE  and  HE/Y  HE  and 
X  NE  WORK)  requires  assumptions  to  be  made  regarding  the  behavior  of  the 
processor  elements.  For  the  lower  unreliability  bound,  idealistic 
assumptions  are  made:  processors  within  a  cluster  may  perform  any  cluster 
task  and  switch  tasks  Instantaneously  with  all  failures  covered.  For  the 
upper  unreliability  bound,  pessimistic  assumptions  are  made:  each 
processor  may  perform  only  the  function  it  was  initially  assigned,  and 
failures  are  covered  85  percent  of  the  time  for  duplex  configurations  and 
100  percent  of  the  time  for  triplex  configurations  and  higher.  These 
bounds  were  chosen  so  that  practical  systems  fall  between  both  bounds.  At 
this  point,  the  number  of  tasks  a  cluster  is  assigned  to  perform  becomes 


an  issue.  In  the  baseline  architecture,  3  clusters  are  assigned  4  tasks, 
and  3  clusters  are  assigned  5  tasks. 

For  the  lover  unreliability  bound,  at  least  4  processors  must  operate 
for  a  4  task  cluster  and  at  least  5  processors  must  operate  for  a  5  task 
cluster  for  the  cluster  to  be  declared  operational.  For  a  4  task  cluster: 
if  all  network  elements  and  memory  elements  worked  for  the  entire  year, 
then  4  processors  out  of  15  must  survive.  If  all  network  elements  worked 
and  1  memory  element  failed  during  the  year,  then  3  processors  are 
rendered  useless  and  4  out  of  the  remaining  12  must  survive.  In  some 
cases,  the  number  of  processors  that  must  survive  must  be  given  in 
probabilistic  terms.  For  example:  if  1  network  element  and  one  memory 
element  failed,  4  processors  out  of  either  9  or  12  processors  must  survive 
depending  upon  which  particular  elements  failed.  In  this  case,  4  out  of 
12  must  survive  20  percent  of  the  time,  and  4  out  of  9  must  survive  80 
percent  of  the  time.  It  is  in  this  manner  the  following  is  computed: 


P(PE  and  HE/<0  or  1  ME)  or  (0  or  1  NE)  WORK)  =  0 
P(PE  and  ME/2  ME  and  2  HE)  =  .  1P(T  of  6  PE  WORK) 

P(PE  and  ME/3  HE  and  2  NE)  =  .3P(T  of  6  PE  WORK) 

P(PE  and  ME/4  HE  and  2  NE)  =  .  6P(T  of  6  PE  WORK) 

P(PE  and  ME/5  ME  and  2  NE)  =  P<T  of  6  PE  WORK) 

P(PE  and  ME/2  ME  and  3  NE)  =  .3P(T  of  6  PE  WORK) 

P(PE  and  ME/3  HE  and  3  NE)  =  .6P<T  of  6  PE  W0RK)>.1P(T  of  9  PE  WORK) 

P(PE  and  ME/4  ME  and  3  NE)  =  .6P(T  of  6  PE  W0RK)*.4P<T  of  9  PE  WORK) 

P(PE  and  ME/5  ME  and  3  NE)  =  P(T  of  9  PE  WORK) 

P<PE  and  ME/2  ME  and  4  NE)  -  .6P(T  of  6  PE  WORK) 

P(PE  and  ME/3  HE  and  4  NE)  =  .6P(T  of  6  PE  W0RK)*.4P(T  of  9  PE  WORK) 

P<PE  and  ME/4  ME  and  4  NE)  =  .8P(T  of  9  PE  W0RK)*.2P(T  of  12  PE  WORK) 

PtPE  and  ME/5  ME  and  4  NE)  =  P(T  of  12  PE  WORK) 

P(PE  and  ME/2  ME  and  5  NE)  =  P(T  of  6  PE  WORK) 

P(PE  and  ME/3  ME  and  5  NE)  =  P(T  of  9  PE  WORK) 

P(PE  and  ME/4  ME  and  5  NE)  =  P<T  of  12  PE  WORK) 

X 

P(PE  and  ME/5  ME  and  5  NE)  =  P(T  of  15  PE  WORK)  (3.1-16) 

where: 

T  =  Number  of  tasks  a  cluster  is  assigned. 

P(T  of  2  PE  WORK)  =  Probability  at  least  T  processors  out  of  2 
operate.  ^ 

P(T  of  2)  can  be  calculated  combinatorially : 

T  -  I  1 

P(T  of  2  PE  WORK)  *  1-C  Z  ((,  Xl-RP)1**  (RP)‘  >] 

i  • » 
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where;  RP  »  Reliability  of  a  aingla  processor  slsasnt. 

For  the  upper  unreliability  bound,  at  least  3  processors  sust  operate 
for  a  3  task  cluster  and  at  least  4  processors  aust  operate  for  a  4  task 
cluster.  Since  upper  bound  calculations  assume  processors  perfora  only 
the  functions  they  were  initially  assigned,  a  4  task  cluster  would 
delegate  tasks  as  depicted  in  figure  3.1-3.  Each  task  is  distributed 
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Figured.  1-3:  Delegation  of  Processors  for  a  4  Task  Cluster. 
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among  different  network  end  memory  eleaents  to  avoid  single  point 
failures.  Three  tasks  would  be  controlled  by  quadruplexes  of  processors 
and  one  task  would  be  controlled  by  a  triplex  of  processors.  If  all 
network  eleaents  and  aeaory  eleaents  worked  for  the  entire  year,  then  1 
processor  froa  each  quadruples  and  1  processor  froa  the  triplex  aust 
survive.  As  with  the  lover  bound,  soae  cases  aust  be  described 
probabilistically.  If  one  network  eleaent  and  1  aeaory  eleaent  failed 
during  the  year,  they  are  aasuaed  to  have  failed  at  the  beginning  of  the 
year.  This  assuaption  aiaplifies  the  aodel  and  tends  to  aake  the 
calculated  unreliability  higher  than  the  actual  reliability,  thus 
aaintainlng  the  Integrity  of  the  upper  bound.  In  this  cane,  1  triplex  of 
processora  and  3  duplexes  of  processors  would  remain  in  56  percent  of  the 
cases.  Two  triplexes,  1  duplex,  and  1  simplex  would  remain  in  24  percent 
of  the  cases.  One  quadruples,  2  triplexes  and  1  duplex  would  remain  in  12 
percent  of  the  cases  and  4  triplexes  would  remain  in  8  percent  of  the 
cases.  The  P<PE  and  HE/Y  ME  and  X  ME)  equations  (3.1-16)  derived  for  the 
lover  bound  hold  true,  but  P(T  of  X  PE)  is  redefined  as  follows: 


P<4  of  6  PE) 


P<4  of  9  PE) 


P<4  of  12  PE) 


P< 4  of  13  PE) 


P< 5  of  6  PE) 


P< 5  of  9  PE) 


P( 5  of  12  PE) 


P< 5  of  15  PE) 


.  9<  RD*  x  RS*  > 


. 3(  RT* x  RO  x  RS  >  ♦  .7<  RT  x  RD* ) 


6<  RO  x  RT*x  RD  >  ♦  .  4RT* 


RO*x  RT 


.  3<  RD  x  RS*  ) 


.  5<  RT  x  RD*x  RS*  >  *  .5<  RD*  x  RS  ) 


RT*  x  RD* 


(3. 1-18) 


B 


where : 


RS  =  the  reliability  of  a  single  processor: 

RS  *  RP  (3.1-19) 

RD  =  the  reliability  a  duplex  of  processors: 

RD  *  1-C (1-RP)* ♦  ( 1-C)2RP( 1-RP)  ]  (3.1-20) 

RT  =  the  reliability  of  a  triplex  of  processors: 

RT  =  1-t (1-RP)1 ♦  ( 1-C)3RP( 1-RP)* 1  (3.1-21) 

RQ  *  the  reliability  of  a  quadruplex  of  processors: 

RQ  *  1-t (1-RP)4 ♦  (1-C)4RP(1-RP)1 1  (3.1-22) 

C  *  Coverage  of  duplex  of  processors  *  .85 
The  reliability  sodels  for  both  upper  and  lover  bounds  vere 
progressed  using  FORTRAN.  The  baseline  FTPP  unreliability  as  a  function 
of  cosponent  HTTFs  is  depicted  graphically  in  figure  3.1-4.  These  bounds 
are  cospared  to  the  required  unreliability  bounds  derived  in  section  2. 2 
for  various  FTPP  redundancy  levels.  Clearly,  the  baseline  architecture  is 
undesirable  in  terss  of  the  application  reliability  requiresent.  Only  the 
aoat  idealistic  processor  behavior  (lover  bound  assusptions)  and 
optiai8tlc  HTTFs  would  support  even  a  quadruplex  computer  system.  Also 
note  that  the  baseline  unreliability  bounds  are  not  straight  lines  as 
would  be  expected  in  a  system  with  an  exponentially  distributed  failure 
time  since  the  graph  plots  the  log  of  unreliability.  Since  there  is 
paralleliss  involved  in  the  architecture,  the  total  system  need  not  be 
exponentially  distributed,  even  though  all  components  of  the  FTPP  are 
assumed  to  be  exponentially  distributed  C61.  After  HTTF  of  approximately 
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35000  hours,  hovever,  both  curves  are  relatively  straight  and  could  be 
approxiaated  exponentially.  The  following  sections  in  this  chapter 
explore  Methods  to  further  lover  the  unreliability  curve. 
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Figure  3.1-4:  Baseline  Systea  Lover  and  Upper  Unreliability 
Bounds. 
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3.2  I NTRA -CLUSTER  MODIFICATION  OPTIONS 


This  section  will  address  modifications  at  the  cluster  level  and 
their  effect  on  cluster  reliability.  Section  3.2.1  examines  the 
sensitivity  of  the  cluster  reliability  to  changes  in  component  HTTF.  The 
reaainder  of  the  sections  focus  primarily  on  the  processor  elements  which 
represent  the  most  flexible  components  of  the  FTPP  in  terms  of 
architectural  and  redundancy  strategy  modification.  Section  3.2.2 
examines  modification  of  the  number  of  processor  elements  per  network  and 
memory  element.  Section  3.2.3  examines  the  implications  of  assigning 
different  numbers  of  tasks  to  the  processors  of  a  cluster.  Section  3. 2. 4 
examines  the  trade  offs  between  reliability  and  throughput  inherent  in 
parallel  processors  and  section  3. 2. 5  examines  reliability  bottlenecks. 

3. 2. 1  Cluster  Reliability  Sensitivities 

Cluster  reliability  sensitivities  determine  which  component 
improvements  provide  the  highest  payoff.  Using  the  reliability  model 
developed  in  section  3.1,  the  HTTF  of  the  four  components  was  set  constant 
at  HTTF«50000  hours.  Each  element's  HTTF  was  then  varied  individually 
from  10000  to  100000  hours.  The  results  of  this  exercise  using  both  lover 
and  upper  unreliability  assumptions  on  a  baseline  cluster  are  depicted 
graphically  in  figures  3.2-1  and  3.2-2  respectively.  Using  lover  bound 


■  i*v  i»4Nn'  CT1 


SENSITIVITIES 

I’LOWII  kxjmd) 


Figure  3. 2-1 s  Cluster  Reliability  Sensitivities 
(Lover  Bound  Aaeueptione) . 

SENSITIVITIES 

fiirwww  no>JMO"> 

O  - - - - - ’ 

( 

r- 

% 

■  »  * 


OOOO  20000  J OOOO  40000  SOOOO  60000  70000  flOOOO  90000  OOOOO 

WTTF  fM«S> 

□  ME  ♦-  FE  o'  ME  a  IOC 

Figure  3.2-2:  Cluster  Reliability  Sensitivities 
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assumptions,  cluster  reliability  ie  aost  sensitive  to  changes  in  the  HTTF 
of  network  eleaents  followed  by  memory,  processor,  and  input/output 
eleaents.  Using  upper  bound  asauaptions,  cluster  reliability  is  aost 
sensitive  to  changes  in  the  HTTF  of  processor  eleaents  followed  by 
network,  aeaory,  and  input/output  eleaents. 

In  both  lower  and  upper  bound  cases:  network  eleaents  figure 
proainently,  and  input/output  eleaents  figure  least  prominently  in 
determining  cluster  reliability.  Loss  of  a  network  eleaent  renders  three 
processor  eleaents,  one  aeaory  eleaent,  and  one  input/output  element 
useless.  Because  of  the  dispersion  of  processor  tasks  discussed  earlier, 
the  loss  any  two  network  elements  will  never  by  itself  cause  cluster 
failure.  Loss  of  an  input/output  eleaent,  on  the  other  hand,  has  no 
effect  on  the  usefulness  of  other  cluster  components.  Using  upper  bound 
aasuaptions,  processor  eleaents  Jump  to  aost  prominent  status  in 
determining  cluster  reliability.  Using  upper  bound  assumptions,  the  loss 
of  only  two  processors  aay  cause  task  loss  and,  therefore,  cluster 
failure.  In  both  cases,  aeaory  eleaents  fall  in  the  middle.  Loss  of  a 
aeaory  eleaent  renders  three  processor  eleaents  useless.  Once  again, 
because  of  the  dispersion  of  processor  tasks,  the  loss  of  any  two  aeaory 
eleaents  will  never  by  itself  cause  cluster  failure. 

3. 2. 2  Cluster  Architecture 

The  baseline  cluster  architecture  utilized  three  processors  for  each 
network  and  aeaory  eleaent,  and  the  general  FTPP  architecture  allows 


Modification  of  this  number.  Modifying  the  number  of  processors  results 


in  tvo  opposing  effects  on  cluster  reliability.  As  nore  processors  are 
added  to  a  cluster,  the  reliability  of  the  processors  to  perform  the 
assigned  tasks  increases.  In  order  to  support  this  modification,  the 
complexity  of  the  network  and  memory  elements  increases,  thus  decreasing 
their  reliablility.  Whichever  effect  dominates  under  the  existing 
conditions  determines  the  optimum  number  of  processors  per  network  and 
memory  element.  To  examine  these  effects  analytically,  consider  a  system 
consisting  of  one  memory  element,  one  network  element,  and  M  processor 
elements  where  N  is  the  number  of  processors  per  network  and  memory 
element.  The  reliability  of  this  trio  of  elements  (RT)  is  the  product  of 
their  reliabilities  because  all  three  must  operate  for  a  single  task  to  be 
accomplished: 

RT  *  RHE.  x  RNE.  x  RPE.  (3.2-1) 

where: 

RME.  3  The  reliability  of  a  memory  element  with  H  connections  and 
failure  rate  L. 

RMEa  3  e-  ‘  •  (3.2-2) 

RNE.  3  The  reliability  of  a  network  element  with  N  connections  and 
failure  rate  . 1L. 

RNE.  3  e*  *  •  •  l  . .  o  i . i  > »  (3.2-3) 
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RPE.  *  The  reliability  of  a  aet  of  H  processors,  each  vith  failure 


rate  L,  to  perfora  a  task.  The  unreliability  of  each 
processor  (Q)  equals  l-e*L'-  The  coverage  of  a  duplex  of 
processors  is  C  and  the  coverage  of  triplexes  and  above  is  1. 
RPE«  3  l-t(Q)"  ♦  N( 1-C) (Q)“* 1  ( 1-Q) ]  (3.2-4) 

The  optiaua  N  is  the  N  which  aaxiaizes  the  reliability  of  the  trio.  This 
N  can  be  coaputed  by  taking  the  derivative  of  RT  vith  respect  to  N, 
setting  the  result  equal  to  zero,  and  solving  for  N: 
d(RT) 

. 3  X  -  Q*  tX  ♦  ln(Q)  ♦  0-‘  (l-Q)(l-C)(l«-XH*Nln(Q))J  (3.2-5) 

dN 

where:  X  3  -.  ULt 

Equation  3. 2-5  can  not  be  solved  for  M  explicitly.  The  optiaua  nuaber  of 
processors  was  derived  nuaerically  for  Cs.85  and  ts8760  hours.  Results 
are  depicted  graphically  in  figure  3. 2-3  for  a  range  of  coaponent  HTTFs. 
The  optiaua  nuaber  of  processor  elements  per  network  and  aeaory  eleaent 
decreases  as  coaponent  HTTF  increases.  For  the  values  chosen,  the  optiaua 
nuaber  of  processors  is  four  for  coaponent  HTTFs  below  20000  hours  (200000 
hours  for  network  eleaents)  and  three  for  HTTFs  above  20000  hours. 
Therefore,  the  baseline  cluster  architecture  is  optiaal  under  the  defined 
architectural  constraints  when  reasonably  reliable  coaponents  are  used. 
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Figure  3.2-3:  Optimum  Number  of  Processors  per  Single  Metvork/ 
Heaory  Element  va.  Component  MTTF. 


3. 2. 3  Cluater  Taak  Assignment 


Assigning  fever  tasks  per  cluster  increases  cluster  reliability,  but 


at  a  coat  vhich  can  be  paid  in  tvo  ways  by  the  total  system.  First:  in 
order  to  make  up  for  the  decrease  in  tasks  per  cluster,  more  clusters  can 
be  added  to  the  system.  In  this  case,  the  probability  of  a  cluster 
isolation  increases  since  more  clusters  are  being  introduced  in  the 
system.  Second:  the  same  number  of  clusters  may  remain  in  the  system.  In 
this  case,  the  total  system  taskload  and,  therefore,  throughput  decreases. 


51 


In  a  parallel  processor,  throughput  and  reliability  are  interchangeable 
quantities.  This  trade  off  will  be  examined  in  more  detail  in  the  next 
section  (3.2.4). 

Using  the  FTPP  system  model  constructed  for  the  fully  linked  baseline 
system  (section  3.1),  cluster  reliabilities  for  tvo,  three,  four,  and  five 
task  clusters  were  calculated  for  the  assumptions  of  perfect  and  imperfect 
processor  coverage.  Results  for  both  lover  and  upper  bound  assumptions 
are  depicted  graphically  in  figures  3. 2-4  and  3. 2-5  respectively.  As 
expected,  cluster  reliability  increases  as  fever  tasks  are  assigned  to  the 
cluster.  The  systems  effect  of  redistributing  and  reducing  the  number  of 
tasks  per  cluster  by  using  additional  clusters  is  examined  in  section 
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Figure  3.2-5:  Cluster  Unreliability  vs.  HTTF  for  an  N  Task  Cluster 
(Upper  Bound  Assuaptions). 
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3.2.4  Reliability/Throughput  Trade  Off 

Reliability  may  be  traded  for  throughput  in  a  parallel  system.  The 
relationship  nay  be  fixed  during  the  design  process,  or  the  trade  off  may 
occur  during  operation.  As  an  example  of  the  later:  during  critical 
computations,  the  FTPP  nay  switch  to  a  lover  throughput  mode  to  achieve  a 
higher  probability  of  success  and  switch  back  to  the  higher  throughput 
mode  when  the  critical  computations  are  completed. 

For  the  purposes  of  this  thesis,  the  FTPP  always  works  one  problem. 
The  tern  'task'  refers  to  the  number  of  parallel  computational  paths  that 
the  problem  is  broken  up  into.  Each  parallel  computational  path  may  be 
performed  by  n  processor  elements.  Figure  3. 2-6  graphically  depicts  the 
expected  speedup  versus  the  number  of  tasks  operating  in  parallel  for  the 
ideal  case,  with  the  lover  and  upper  speedup  bounds  as  defined  by  Hwang 
and  Briggs  (described  in  section  1.1).  The  differences  are  rather  large. 
While  15  parallel  tasks  would  ideally  generate  a  speedup  15  times  as  fast, 
the  lover  and  upper  speedup  bounds  are  only  about  4  and  5  times  as  fast. 
The  apparent  jump  in  the  graph  for  the  n/ln(n>  case  at  2  tasks  is  an 
anomaly  of  the  equation  which  is  clearly  impossible  and  should  be  ignored. 
Achieving  a  given  speedup  factor  requires  an  extremely  large  number  of 
parallel  tasks.  On  the  other  hand,  the  loss  of  a  task  would  represents  a 
much  smaller  loss  in  throughput.  The  loss  of  a  single  task  in  an  n  task 
parallel  system  would  ideally  represent  a  throughput  loss  (TUo»«)  of  1 
times  the  throughput  of  a  single  processor  (Tt»).  Using  the  speedup 
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Figure  3.2-6:  Systea  Speedup  vs.  Nusber  of  Parallel  Tasks. 

estimates  in  section  1. 1,  the  loss  of  a  single  task  in  a  n  cluster  systea 
vould  represent  a  throughput  loss  defined  by  the  following  equations: 

Ti a • t  *  Clog* (n/<n-l) )]T*p  (Minsky's  Conjecture)  (3.2-6) 

OR 

Ti. o •  •  =  ( (n/ln(n)  )-<  (n-l)/ln(n-l) )  JT,p  (Upper  Bound)  (3.2-7) 

For  a  15  task  systea,  the  expected  throughput  loss  vould  be  (l)T»a  for  the 
ideal  case  and  between  (.l)Ta»  and  (.23)Ta»  using  equations  3.2-6  and  3.2- 
7  respectively.  Depending  on  the  actual  application  requireaent,  the  loss 


of  at  least  one  or  aore  coaputational  tasks  should  be  tolerable. 
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throughput  ia  accoapanlad  by  a  dacraaaa  in  reliability.  Uaing  upper  bound 


processor  assumptions,  the  cost  of  additional  throughput  in  reliability 
loss  is  greater  than  for  the  lover  bound  case.  If  the  Hvang  and  Briggs 
speedup  bounds  are  used,  the  cost  increases  even  sore  and  both  curves 
vould  rise  sore  sharply. 

Pover  consuaption  is  yet  another  tera  vhich  say  be  added  into  the 
reliability/throughput  trade  off.  Processors  a  ay  be  turned  off  when  pover 
consuaption  is  critical  at  the  cost  of  either  reliability  or  throughput. 
The  act  of  turning  coaponents  on  and  off  in  an  operational  system  aay  be 
risky  in  itself.  Current  operational  spacecraft  turn  coaponents  on  and 
off  only  vhen  absolutely  necessary. 

The  systea  throughput,  reliability  and  pover  consuaption  aay  be 
'tuned'  by  the  designer  of  a  parallel  processor  to  achieve  a  aix  vhich 
beat  aeets  the  application  requireaent.  The  tuning  is  done  by  defining 
and  redefining  the  delegation  of  processors  to  the  required  tasks. 

3. 2. S  Reliability  Bottlenecks 

The  previous  section  shoved  that  the  loss  of  a  computational  task  in  a 
reasonably  large  parallel  systea  is  a  tolerable  event.  This  is  not  true 
for  the  tasks  of  cluster  and  global  controller.  The  loss  of  a  cluster 
controller  results  in  cluster  loss,  and  the  loss  of  the  global  controller 
results  in  systea  loss.  A  cluster  can  never  be  aore  reliable  than  its 


cluster  controller,  and  the  systea  can  never  be  more  reliable  than  its 


global  controller.  The  tasks  of  cluster  and  global  controller  can 
accurately  be  described  as  'reliability  bottlenecks*  which  deserve  special 
attention.  In  the  baseline  architecture,  three  processor  elements  were 
reserved  for  each  cluster  controller,  and  four  processor  eleaents  were 
reserved  for  the  global  controller.  In  view  of  the  relative  importance  of 
tasks,  it  would  be  logical  to  transfer  processor  elements  performing 
computational  tasks  to  the  tasks  of  cluster  or  global  controller  should  a 
CC  or  GC  experience  degradation.  Using  this  redundancy  strategy 
effectively  means  the  controllers  have  access  to  unlimited  spares.  Figure 
3. 2-8  diagrams  the  Harkov  Model  of  the  case  of  a  cluster  controller  with 
unlimited  spares.  While  this  strategy  is  certainly  more  reliable  than  a 
simple  triplex  of  processor  elements,  the  Harkov  Hodel  shows  that  there  is 
still  the  possibility  of  controller  loss  due  to  the  finite  reconfiguration 
time  of  the  processors.  If  a  second  failure  occurs  before  the  processors 
can  reconfigure,  the  controller  may  be  lost. 

Other  possible  reliability  bottlenecks  are  the  loss  of  a  memory  or 
network  element  causing  the  loss  of  a  task.  As  discussed  earlier,  the 
dispersion  of  tasks  between  different  memory  and  network  elements  can 
eliminate  single  point  failures  provided  that  the  cluster  is  not 
overloaded  with  tasks. 


L»  PROCESSOR  FAILURE  RATE 
R-  PROCESSOR  RECONFIGURATION  RATE 


3.3  INTER-CLUSTER  MODIFICATION  OPTIONS 


This  Motion  will  address  aodif ications  at  tha  aystaa  laval  and  thair 
affacte  upon  ayataa  parforaanca.  Saction  3. 3. 1  axaainaa  tha  ralativa 
parforaanca  of  thrM  aalactad  cluatar  topologiaa.  Saction  3. 3. 2  axaainaa 
cluatar  radundancy  strategies  and  thair  affact  on  ayataa  raliabillty. 

3. 3. 1  Cluatar  Topologiaa 

Cluatar  topology  rafara  to  tha  particular  Mthod  aaployad  to  link 
cluatara.  Cluatar  topology  affacta  throughput,  raliabillty 
aaintainability,  and  aodularity.  Intarconnaction  atratagy  ia  a  kay  factor 
in  obtaining  high  parforaanca,  in  raduclng  coat,  and  in  kaaping  tha  ayataa 
faaaibla  in  taraa  of  anginaaring  [7]. 

Graph  thaory  providaa  uaaful  tarainology  in  tha  analyaia  of  cluatar 
topologiaa.  Cluatara  ara  dafinad  aa  nodaa  and  tha  diatanca  batvaan  any 
pair  of  adjacent  cluatara  ia  dafinad  aa  1.  Tha  diaaatar  of  tha  graph  (k) 
la  dafinad  aa  tha  aaxiaua  of  tha  length  of  tha  ahortaat  patha  batvaan  any 
tvo  nodaa.  A  graph  of  diaaatar  k  will  taka  no  aora  than  k  hopa  to  travel 
batvaan  any  tvo  nodaa.  The  fan-out  (d)  ia  dafinad  aa  tha  nuabar  of 
connactiona  aaanating  froa  each  node  providing  all  nodaa  have  tha  aaaa 
fan-out.  One  of  tha  topologiaa  axaainad  in  thie  thaaia  haa  a  fan-out 
vhich  varies  bavaan  clusters.  A  graph  having  n  nodes,  diaaatar  k,  and 


fan-out  d  la  defined  a a  an  <n, k,  d)  graph.  Each  graph  can  be  redefined  by 
exaaining  t  link  failures.  Link  failures  aay  increase  diaaeter  [6,9]. 

The  diaaeter  of  a  FTPP  systea  is  a  aeasure  of  the  speed  of  the 
network.  As  the  diaaeter  increases,  so  does  the  coaeunications  tise 
between  certain  nodes.  An  increase  in  coaaunication  tiae  decreases  systea 
throughput.  The  fan-out  of  a  cluster  is  a  aeasure  of  the  coaplexity  of  a 
cluster.  As  the  fan-out  increases,  so  does  the  coaplexity  of  the  nodes. 
An  increase  in  nodal  coaplexity  translates  to  a  decrease  in  nodal 
reliability  though  not  necessarily  systea  reliability.  The  fan-out  is 
also  a  aeasure  of  the  aaintainability  and  aodularity  of  a  systea.  As  the 
fan-out  increases,  it  becoaes  increasingly  coaplicated  to  replace  failed 
clusters  and  to  add  new  clusters. 

For  a  given  nuaber  of  nodes,  the  aost  desirable  graph  would  have  a 
diaaeter  and  fan-out  of  one.  This  is  only  possible  for  the  case  of  n*2. 
As  n  increases,  the  fan-out  aust  increase  to  keep  the  diaaeter  aaall  or 
the  diaaeter  aust  increase  to  keep  the  fan-out  saall.  The  following 
section  will  anslyze  three  cluster  topologies  in  an  atteapt  to  work  out 
the  various  trade  offs  Involved  in  selecting  an  interconnection  strategy. 
The  three  topologies  to  be  exaained  include:  centrally  linked  (star), 
fully  linked  (fully  cross-strapped),  and  singly  linked  (ring)  -  (figure 
3.3-1). 

The  reliability  analysis  in  each  of  the  following  sections  will 
calculate  the  probability  of  at  least  one  cluster  isolation  P( ISOLATION) 
for  the  different  topologies  looking  exclusively  at  input/output  eleaents. 


This  probability  can  be  used  to  aatiaate  the  FTPP  system  reliability 
<P(S))  using  the  results  of  section  3.1.  In  section  3.1,  P(S>  vas  derived 
for  a  fully  linked  system.  P<  ISOLATION) ,  looking  exclusively  at 
input/output  elements,  vas  approximated  by  neglecting  failures  due  to 
input/output  elements  of  other  clusters  and  found  to  be  relatively 
accurate.  Using  these  probabilities,  the  following  equation  holds  true: 

P(S)  «  ( 1-P< ISOLATION ) )xRC*  (3.3-1) 

where : 

RC  *  The  reliability  of  a  cluster  neglecting  the  effects  of  the 
input/output  elements, 
n  *  Number  of  clusters  in  the  FTPP  system. 

Therefore,  RC  can  be  estimated  as: 

RC  »  IP(S)/( 1-P( ISOLATION) )  J* '  •  (3.3-2) 

RC  will  remain  constant  for  different  cluster  topologies  since  topology 
only  affects  input/output  elements  which  are,  by  definition,  independent 
of  RC.  Knowing  P< ISOLATION)  for  each  topology,  equation  3.3-2  can  be  used 
to  estimate  RC  and  equation  3. 3. 1  used  to  calculate  the  total  system 
reliability.  The  following  analysis  will  address  the  baseline  case  of  n=6 
specifically  and  then  generalize  for  all  n. 


Figure  3.3-1:  Cluster  Topologie*. 


3. 3. 1.1  Centrally  Linked  Topology 

Figure  3. 3-l(A)  depicts  the  centrally  linked  topology  of  6  clusters. 
With  no  link  failures  (t*0):  k=2  ,  d*5  (for  the  central  cluster),  and  d-l 
(for  the  distributed  clusters).  The  advantages  of  this  topology  are  the 
lov  dlaaeter  and  the  lov  fan-out  of  the  distributed  clusters.  The 
diaaeter  remains  tvo  and  the  fan-out  of  the  distributed  clusters  reaains 
one  no  aatter  hov  many  clusters  are  added  to  the  network.  The 
disadvantages  of  this  topology  are  the  lov  tolerance  to  link  failures  and 
the  high  fan-out  of  the  central  cluster.  Any  one  link  failure  will  cause 
an  isolation,  and  the  fan-out  of  the  central  cluster  increases  with  the 
nuaber  of  clusters  (d*n-U.  While  distributed  clusters  can  be  added  and 
replaced  relatively  easily,  the  central  cluster  is  lacking  in 
aalntainability. 

The  reliability  of  the  centrally  linked  topology  is  relatively  easy 
to  calculate  by  decoaposing  on  the  input/output  elements  of  the  central 
cluster.  Since  any  one  link  failure  will  cause  a  cluster  isolation,  the 
probability  of  at  least  one  Isolation  is  the  coapleaent  of  the  probability 
of  no  link  failures: 

P( ISOLATION)  *  1-CP(0  LINK  FAILURES)]  (3.3-3) 

P(0  LINK  FAILURES)  can  be  found  by  decoaposing  on  the  input/output 
eleaents  of  the  central  cluster. 

* 

P(0  LINKS  FAIL)  >  E  CP(0  LINKS  FAIL/N  10  W0RK)xP(N  10  WORK)]  (3.3-4) 
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vhere : 


P(H  10  WORK)  -  Probability  that  exactly  N  out  of  5  Input/output 
eleaenta  of  the  central  cluster  operate. 

P(0  LINKS  FAIL/N  10  WORK)  >  Probability  that  no  links  fail  given 
that  exactly  N  input/output  eleaents  of  the  central  cluster 
operate. 

P(N  10  WORK)  can  be  calculated  coabinatorially  using  equation  3.1-5  and 
substituting  R10C  for  RIO  vhere  RIOC  equals  the  reliability  of  a  single 
input/output  eleaent  of  the  central  cluster.  The  probability  that  no 
links  fail,  given  that  exactly  N  input/output  eleaents  of  the  central 
cluster  operate,  can  be  calculated  by  using  the  fact  that  every 
distributed  cluster  aust  possess  at  least  2  operational  lines  to  the 
central  cluster: 

F(0  LINKS  FAIL/N  10  WORK)  »  £P(at  least  2  of  N  IOD  WORK)]9  (3.3-5) 
vhere: 

IOD  »  Input/output  eleaent  of  a  distributed  cluster. 

P(at  least  2  of  N  IOD  WORK)  is  found  using  equation  3.1-14  and 

substituting  RIOD  and  N  for  RIO  and  X  respectively  vhere  RIOD 
equals  the  reliability  of  a  single  input/output  eleaent  of  a 
distributed  cluster. 

To  generalize  for  n,  the  exponent  in  equation  3.3-5  becoaes  (n-1). 


I’ljw  |||  *i.  «g|  >»i  m  ^  util'*.!11 


3. 3.1.2  Fully  Linked  Topology 

Figure  3. 3-l(B)  depicts  the  fully  linked  topology  of  6  clusters  which 
was  modelled  in  section  3.1.  With  no  link  failures  (t=0)j  k=l  and  d=S 
for  all  clusters.  For  t»l  to  ts4:  k=2.  Five  link  failures  nay  cause 
isolation  depending  on  which  links  fail.  The  advantages  of  this  topology 
are  the  low  diaaeter  and  the  high  tolerance  to  link  failures.  The 
diameter  remains  one,  and  the  tolerance  to  link  failures  increases  as  more 
clusters  are  added  to  the  network.  A  fully  linked  network  can  tolerate  at 
least  n-2  link  failures  without  an  isolation.  The  major  disadvantage  of 
this  network  is  the  high  fan-out  of  the  clusters.  The  fan-out  increases 
as  more  cluster  are  added  to  the  system  lk=n-ll.  The  network  is  clearly 
lacking  in  the  areas  of  maintainability  and  modularity. 

While  the  complexity  associated  with  a  fully  linked  network  makes  an 
exact  reliability  analysis  difficult,  the  reliability  can  be  bounded 
relatively  tightly  as  was  shown  in  section  3.1.1.  A  lower  unreliability 
bound  was  generated  by  neglecting  cluster  isolations  due  to  input/output 
failures  in  other  clusters  and  multiplying  the  reliabilities  of  all 
clusters.  An  upper  reliability  bound  was  generated  by  taking  into  account 
all  possible  failure  modes  for  each  cluster  and  multiplying  the 
reliabilities  of  all  clusters.  The  dominant  failure  node  for  a  particular 
clumter  Isolation  is  the  failure  of  that  cluster's  input/output  elements 
and  is  why  the  reliability  can  be  tightly  bounded.  This  failure  node 
becomes  increasingly  dominant  as  n  increases. 
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3.3. 1.3  Singly  Linked  Topology 

Figure  3. 3-1(0  depicts  the  singly  linked  topology  of  6  clusters. 
With  no  link  failures  (t*0>:  k=3  and  d»2  for  all  clusters.  The  Major 
advantage  of  this  topology  is  the  lov  fan-out  of  the  clusters.  The  fan¬ 
out  resaina  tvo  no  satter  hov  aany  clusters  are  added  to  the  network.  The 
disadvantages  of  this  topology  are  the  lov  tolerance  to  link  failures  and 
the  relatively  high  dianeter.  Any  tvo  link  failures  will  always  cause  an 
isolation,  and  the  diaaeter  increases  as  nore  clusters  are  added  to  the 
network  Ck*n/2  (n  even);  k*<n-l>/2  (n  odd)].  The  network  possesses  both 
high  Maintainability  and  Modularity. 

The  reliability  of  the  singly  linked  network  is  the  Most  difficult  to 
analyze.  Unlike  the  centrally  linked  architecture,  the  probability  of  an 
isolation  can  not  be  calculated  using  a  staple  decoaposition.  Unlike  the 
fully  linked  network,  the  isolation  of  a  particular  cluster  is  not 
doainated  by  failures  of  its  own  input/output  eleaents.  The  fact  that 
adjacent  link  failures  are  dependent  Must  be  taken  into  account.  The 
probability  of  an  isolation  can  be  calculated  using  coahinatorial  Methods 
and  conditional  probability. 

Since  any  tvo  link  failures  will  cause  an  isolation,  the  probability 
of  at  least  one  isolation,  P< ISOLAT ION >,  is  the  coapleaent  of  the 
probability  of  zero  or  one  link  failure: 

PUSOLATON)  *  1-IP<0  LINKS  FAIL)  ♦  P<  1  LINK  FAILS)]  (3.3-6) 

To  calculate  the  probability  of  zero  link  failures,  conditional 
probability  Must  be  used.  Link  failures  are  dependent  and  each  link's 
probability  of  operating  is  conditional  on  the  state  of  the  other  links: 


P<0  LINKS  FAIL)  =  P(Ll)xP(L2/Ll>xP<L3/Ll  A  L2)xP(L4/Ll  A  L2  A  L3)x 
p(L5/n  a  L2  a  L3  a  L4)xp<l&/li  n  L2  n  ls  n  L4  n  ls>  0.3-7) 

where : 

P( LX/LA  A  ...  0  LE)  *  Probability  that  link  X  operates  given  that 
links  A  through  E  operate. 

The  probability  link  1  operates  is  the  probability  that  at  least  2  lines 
in  the  link  operate.  For  a  line  to  operate,  the  input/output  elements  on 
both  ends  sust  function. 

P(L1)  *  l-[  (l-(RIO)*)1  ♦  5(1- (RIO)*  )*  (RIO)1  1  (3.3-8) 

where:  RIO  *  Probability  that  a  single  input/output  elesent 
operates. 

The  probability  that  link  2  operates,  given  link  1  operates,  can  be 
calculated  using  conditional  probability: 

P(L2/L1 )  *  P(L1  A  L2)  /  P(L1>  (3.3-9) 

P(L1)  has  been  calculated  in  equation  3.3-8.  The  probability  that  two 
adjacent  links  operate,  P(L1  0  L2),  can  be  calculated  by  decosposing  on 
the  input/output  elements  of  the  cluster  the  two  links  share  (cluster  2): 

■ 

P(L1  n  L2I  *  E  CP(L1  A  L2/N  10  WORK)  x  P(N  10  WORK)]  (3.3-10) 

*  •  • 

where: 

P(L1  A  L2/N  10  WORK)  *  Probability  that  Li  and  L2  operate  given 
that  exactly  N  input/output  elesents  of  the  shared  cluster 


operate. 


P(N  10  WORK)  has  been  calculated  In  aquation  3.1-5.  P(L1  D  L2/N  10  WORK) 
can  be  calculated  using  the  fact  that  at  least  two  operational  lines  aust 
operate  between  clusters  1  and  2  and  between  clusters  2  and  3: 

P(L1  0  L2/N  70  WORK)  *  P<at  least  2  of  N  10  of  cluster  1  WORK)  x 

P(at  least  2  of  N  10  of  cluster  3  WORK)  (3.3-11) 
The  probability  that  at  least  2  of  N  input/output  elements  of  a  cluster 
operate  can  be  calculated  using  equation  3.1-14  and  substituting  N  for  X. 

The  probability  that  link  3  operates,  given  that  links  1  and  2 
operate,  is  equivalent  to  the  probability  that  link  3  operates  given  that 
link  2  operates.  The  state  of  link  1  has  no  effect  on  link  3  because  they 
are  independent.  Only  adjacent  links  are  dependent.  Therfore: 

P(L3/L1  0  L2)  =  P(L3/L2)  =  P<L2/L1)  (3.3-12) 
Using  the  ease  line  of  reasoning: 

P(L2/Li)  »  P(L3/L1  0  L2)  «  P(L4/L1  0  L2  f)  L3) 

=  P(L5/L1  n  L2  n  L3  n  L4)  (3.3-13) 
These  equalities  all  describe  the  probability  that  a  link  operates  given 
that  one  adjacent  link  operates.  The  probability  that  link  6  operates, 
given  that  links  1  to  5  operate,  describes  the  probability  a  link  operates 
given  that  both  adjacent  links  operate.  Once  again,  conditional 
probability  is  eaployed: 

P(L6/L1  n  L2  n  L3  n  L4  n  L5)  =  P(L6/L5  fl  LI) 

=  P< LI  n  L5  n  L6)  /  P(L1  0  L5)  (3.3-14) 
Since  links  1  and  5  are  non-ad jacent,  hence  independent,  the  probability 
they  both  operate  is  the  product  of  the  reliabilities: 


P<  LI  n  LS)  «  P<L1 >xP<L5)  *  P(L1 )* 


(3. 3-15) 


P(L1)  has  been  calculated  In  aquation  3.3-8.  Tha  probability  that  linka  1 
and  5  and  6  oparata  can  ba  calculated  by  dacoapoalng  on  tha  input /output 
alaaanta  of  ona  cluatar  aharad  by  2  linka,  and  than  dacoapoalng  on  the 
input/output  alaaanta  of  tha  othar  cluatar  aharad  by  two  linka.  In  thia 
caaa,  tha  firat  dacoapoaition  ia  dona  on  tha  alaaanta  of  cluatar  8,  and 
tha  aacond  dacoapoaition  ia  dona  on  tha  alaaanta  of  cluatar  1. 

Dacoapoalng  on  tha  alaaanta  of  cluatar  8  yialda  tha  aquation: 

P(L1  n  L5  n  L6) 

t 

>  C  CP(L1  fl  L5  n  L6/M  10  of  C6  WORK)  x  P(M  10  of  C8  WORK))  (3.3-18) 


P(N  10  WORK)  haa  baan  calculatad  in  aquation  3.1-5.  Tha  probability  that 
linka  1  and  5  and  6  oparata,  glvan  that  N  10  alaaanta  of  cluatar  8 
oparatea,  can  ba  calculatad  uaing  tha  fact  that  at  laaat  two  linaa  of  each 


of  tha  thraa  linka  auat  oparata. 


P(L1  n  L3  n  L6/(0  or  1)  10  of  C6  WORK)  *  0 


(3. 3-17) 


To  calculata  tha  raaaining  alaaanta,  a  aacond  dacoapoaition  la  dona  on  the 
alaaanta  of  cluatar  1 : 

P(L1  n  L5  n  L6/2  10  of  C6  WORK)  * 

t 

E  tP(Ll  n  LS  n  L6/2  10  of  C8  and  N  10  of  Cl  W0RK)xP(N  10  of  Cl  WORK)) 
•*'  (3.3-18) 

P(M  10  WORK)  haa  baan  calculatad  in  aquation  3.1-5.  P<L1  ft  L5  n  L6/X  10 

of  C6  WORK  and  Y  10  of  Cl  WORK)  la  calculatad  by  dateralnlng  the  product 

of  tha  probabilitiaa  of  thraa  conditiona:  1.  The  probability  that  a 

aufficiant  nuaber  of  input/output  alaaanta  of  cluater  5  operate  to  achieve 


70 


lwwiT.vvwwrvwwvw».«wi,i".> 


v« 


■  working  link  batwaan  cluatara  S  and  6.  2.  Tha  probability  that  a 

aufficiant  nuabar  of  input/output  alaaanta  of  cluatar  2  oparata  to  achiava 
a  working  link  batwaan  cluatara  2  and  1.  3.  Tha  probability  that  tha  X 
oparatlng  alaaanta  in  cluatar  6,  and  Y  oparating  alaaanta  in  cluatar  1, 
will  yiald  a  working  link  batwaan  cluatara  6  and  1: 

R< LI  n  L5  n  L6/2  10  of  C6  WORK  and  N  10  of  Cl  WORK)  * 

P< at  laaat  2  of  2  10  of  CS  W0RK)xP<at  laaat  2  of  N  10  of  C2  WORK)* 
PiLlNK  ia  achiavad  batwaan  C6  and  Cl)  (3.3-19) 

Plat  laaat  2  of  N  10  WORK)  can  ba  calculatad  uaing  aquation  3.1-14 
and  aubatituting  N  for  X: 

Pi  Link  la  achiavad  batwaan  C6  and  Cl)  *  0, 0, . 1, . 3, . 6, 1  for 
N  *  0,1,2,  3,  4,3  raapactivaly- 

Uaing  tha  aaaa  Una  of  raaaoning,  tha  raaainlng  alaaanta  can  ba 
calculatad : 

Pill  n  L3  n  Lfc/3  10  of  C6  WORK)  ■ 

t  t  P i LI  n  L5  D  Lfc/3  10  of  C6  and  N  10  of  Cl  W0RK)*P(N  10  of  Cl  WORK)! 
•“  (3.3-20) 

whara : 

P( LI  n  L5  n  L6/3  10  of  C6  WORK  and  N  10  of  Cl  WORK)  * 

Piat  laaat  2  of  3  10  of  CS  W0RK)xP(at  laaat  2  of  N  10  of  C2  WORK)* 
PiLlMK  la  achiavad  batwaan  C6  and  Cl)  (3.3-21) 

P< Link  ia  achiavad  batwaan  C6  and  Cl)  *  0, 0, . 3, . 7, 1, 1  for 


N  *  0.1,2,  3,  4,5  raapactlvaly. 


P<L1  n  L3  n  L6/4  10  of  C6  WORK)  ■ 


E  (P( LI  n  L5  n  L6/4  10  of  CS  and  N  10  of  Cl  W0RK)xP(N  10  of  Cl  WORK)] 
*•  (3.3-22) 

vhara : 

P < LI  n  L5  n  L6/4  10  of  C6  WORK  and  N  10  of  Cl  WORK)  * 

P(at  laaat  2  of  4  10  of  CS  W0RK)xP(at  laaat  2  of  N  10  of  C2  W0RK)x 
P(LINK  la  achlavad  batvaan  C6  and  Cl)  (3.3-23) 

P(Link  la  achlavad  batvaan  C6  and  Cl)  *  0,0,  .6,  1,1,1  for 
N  ■  0,1,2, 3,4,3  raapactlvaly. 

P(L1  n  L5  n  L6/5  10  of  C6  WORK)  ■ 

• 

E  ( P ( LI  n  L3  (1  L6/5  10  of  C6  and  N  10  of  Cl  W0RK)xP(N  10  of  Cl  WORK)] 

(3.3-24) 

vhara: 

P(L1  0  LS  D  L6/S  10  of  CS  WORK  and  N  10  of  Cl  WORK)  * 

P  <  at  laaat  2  of  3  10  of  CS  W0RK)xP(at  laaat  2  of  N  10  of  C2  W0RK)x 
P ( LIMK  la  achlavad  batvaan  C6  and  Cl)  (3.3-23) 

P(Link  la  achlavad  batvaan  C6  and  Cl)  *  0,0, 1,1, 1,1  for 
N  «  0,1,2,  3,  4,  3  raapactlvaly. 


Tha  probability  that  axactly  ona  link  falla  can  ba  calculatad  by 
arbitrarily  axaalnlng  tha  caaa  vhara  only  link  2  falla.  Am  vaa  dona 
pravloualy : 

P(L2  ONLY)  *  P(Ll)xP(L2/Ll)xP(L3/Ll  n  L2)xP(L4/Ll  ft  L2  n  L3>x 
P(L3/L1  ft  L2  ft  L3  ft  L4)xP(L6/Ll  ft  L2  ft  L3  ft  L4  ft  L5) 


(3. 3-26) 


Sine*  any  on*  link  of  six  link*  nay  fail  with  equal  probability,  and  th*s* 
•v*nta  ar*  mutually  exclusive: 


P(1  LINK  PAILS)  *  6 ( P ( L2  ONLY))  <3.3-27) 
Using  th*  reasoning  employed  previously: 

P(L4/L1  n  L2  rt  L3)  *  P(L4/L3>  (3.3-28) 
P(LS/L1  n  L2  n  L3  n  L4)  *  P<L5/L4)  (3.3-29) 
P(L6/L1  H  L2  H  L3  fl  L4  n  LS)  »  P< L6/L1  0  L5>  (3.3-30) 
P(L2/L1 )  «  P(L4/L3)  «  P(L5/L4)  (3.3-31) 


P(L1),  P(L2/L1),  and  P(LS/L1  0  LS)  have  been  calculated  previously 
P(L2/L1)  is  siaply  the  cospleaent  of  P(L2/L1>: 

P(L2/L1 )  «  l-tP(L2/Ll)J  (3.3-32) 

The  only  resaining  element  to  calculate  is  P(L3/L1  0  L2).  Since  links  1 
and  3  ar*  independent,  th*  calculation  becoaes  th*  probability  that  a  link 
operates  given  that  an  adjacent  link  fails.  As  was  don*  previously, 
conditional  probability  is  eaployed : 

P(L3/L2)  •  P(L3  0  L2 )  /  P<L2)  (3.3-33) 

P(L2)  is  th*  complement  of  P(L1)  vhich  has  been  calculated  previously. 

P(L2)  «  1-CP(L1 ) 1  (3.3-34) 

Th*  probability  link  3  operates  and  link  2  tails  can  be  found  by 
decomposing  on  th*  input/output  elements  of  th*  shared  cluster 
(cluster  3) : 

* 

P(L3  n  L2)  *  £  (P(L3  0  L2/N  10  V0RK>xP(N  10  WORK)) 


(3.  3-35) 


whara : 


P(L3  n  L2/N  10  WORK)  »  P<at  laast  N-l  of  N  10  of  C2  FAIL)  x 

P ( 2  of  N  10  of  C4  WORK)  (3.3-36) 

P(N  10  WORK)  and  P(at  laaat  2  of  N  10  WORK)  hava  alraady  baan  calculatad 

in  aquations  3.1-5  and  3.1-14.  P(at  laaat  N-l  of  N  10  FAIL)  can  ba 

calculatad  coabinatorially : 

i  a 

P(at  laaat  N-l  of  N  10  FAIL)  «  E  £ ( ,  > ( l-RIO)- - 1  < RIO) •  ]  (3.3-37) 

i  •  « 

To  ganaraliza  for  n,  tha  probabilitiaa  that  0  and  1  .Inks  fall  bacoaa: 

P<0  LINKS  FAIL)  -  P(L)  x  P(L/L)*-*  x  P(L/L  0  L)  (3.3-38) 

P(1  LINK  FAILS)  *  N  x  P(L>  x  P(L/L)  x  P(L/L>  x  P(L/L)*-‘  x 

P( L/L  n  L)  (3. 3-39) 

3. 3. 1.4  Topology  Comparisons 

This  aaction  attaapts  to  quantify  tha  parforaanca  of  tha  thraa 
topologias  in  tha  araas  of  throughput,  aalntainablllty,  modularity,  and 
raliability.  Comparisons  will  ba  aada  for  tha  basallna  casa  of  aix 
clustara  spacifically  and  for  tha  ganaral  casa  of  n  cluatare.  For  tha 
purposas  of  this  comparison,  all  clustara  ara  assuaad  to  communicata  with 
all  othar  clustars  with  aqual  probability. 

In  tha  araa  of  throughput,  tha  fully  llnkad  systaa  is  claarly  tha 
most  dasirabla.  Any  two  clustars  can  communicata  dlractly  (1  hop  >  Tha 
cantrally  llnkad  systaa  aquiras  an  avaraga  of  1.67  and  tha  alngly  llnkad 
syatam  an  avaraga  of  1.80  hops  to  communicata  batwaan  two  clustara.  Am  n 
lncraasaa,  tha  ralatlva  rankings  ramaln  unchangad,  and  tta  diffarancas 


become  increasingly  pronounced.  For  the  fully  linked  system,  k«l  always. 

For  the  centrally  linked  aystea: 

•  • 

k.«.  «  l(n-l)  ♦  2<  <»  )  -  (n-1 ) ) 1/  (.)  <3.3-40) 

which  siaplifiea  to: 

k...  •  2  -  (2/<n-l>l  ♦  l 2/n<n-l)l  ■  2  (as  n->e)  (3.3-41) 

For  the  singly  linked  system,  k(,a  increases  without  bound  as  n->e. 

In  the  areas  of  aaintainability  and  modularity,  the  fully  linked 
systea  is  clearly  the  least  desirable.  Replacing  a  cluster  requires  25 
disconnections  and  connections.  Adding  a  cluster  requires  25  connections. 
The  centrally  linked  systea  requires  an  average  of  8.35  and  the  singly 
linked  an  average  of  10  disconnections  and  connections  to  replace  a 
cluster.  Adding  a  cluster  requires  only  5  connections  for  the  centrally 
linked  aystea  while  the  singly  linked  systea  requirea  10  diaconnections 
and  10  connections.  As  n  increases,  the  relative  rankings  reaain  the 
aaae:  5<n-l>  disconnections  and  connections  are  required  to  replace  a 
cluster  and  5<n-l)  connections  are  required  to  add  a  cluster  for  the  fully 
linked  aystea;  an  average  of  3<2n-2>/n  disconnect  Iona  and  reconnections 
are  required  to  replace  a  cluster  and  5  connections  are  required  to  add  a 
cluster  for  the  centrally  linked  systea;  and  10  diaconnections  and 
reconnections  are  required  to  replace  or  add  a  cluster  for  the  singly 
linked  systea. 

In  the  area  of  reliability,  the  relationship  between  the  complexity 
of  the  input/output  eleaents  and  their  failure  rates  is  a  determining 
factor  in  the  relative  reliabilities  of  the  topologies.  The  reliability 


models  for  the  three  topologies  derived  in  the  previoue  eectione  were 

progressed  in  FORTRAN.  Comparisons  were  made  for  the  baseline  case  of  6 

clusters.  The  system  unreliabilities  using  both  lover  and  upper  bound 

assumptions  are  depicted  graphically  in  figures  3.3-2  and  3.3-3 
respectively.  Using  lover  bound  assumptions,  the  unreliability  curves  for 
the  singly  and  fully  linked  systeas  are  close,  but  the  singly  linked 

system  becomes  noticeably  sore  reliable  for  HTTFs  greater  than  60000 
hours.  Using  upper  bound  assumptions,  the  unreliability  curves  of  the 
singly  and  fully  linked  systems  are  nearly  identical  but  the  rav  data 
shove  the  singly  linked  system  slightly  more  reliable.  In  both  cases,  the 
centrally  linked  system  la  least  reliable. 
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Figure  3.3-2:  Systea  Unreliability  for  3  Topologies 
(6  Clusters  and  Lover  Bound  Assusptions) . 
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For  architectures  vith  sore  clusters,  the  singly  linked  systes  would 
be  expected  to  perfora  increasingly  better  relative  to  the  other 
topologies  since  the  coaplexity  of  the  input/output  eleaents  resains 
constant.  Figures  3. 3-4  and  3. 3-5  depict  the  unreliabilities  of  the  three 
topologies  as  the  nuaber  of  clusters  increases  using  an  MTTF  of  60000 
hours  for  lover  and  upper  assuaptions  respectively.  The  singly  linked 
systea  is  clearly  superior  in  both  cases.  The  fully  and  centrally  linked 
systeas  exhibit  a  crossover  at  n=16  and  n*15  clusters  in  the  lover  and 
upper  bound  graphs  respectively.  Before  the  crossover,  the  fully  linked 
is  aore  reliable  than  the  centrally  linked;  after  the  crossover,  the 
opposite  is  true.  To  exaaine  the  effects  of  a  change  in  MTTF,  figures 
3.3-6  and  3.3-7  depict  the  unreliabilities  of  the  three  topologies  using 
an  MTTF  of  100000  hours  for  lover  and  upper  bound  assuaptions 
respectively.  The  graphs  exhibit  the  saae  relative  characteristics  as  the 
NTTF*60000  hours  case,  but  the  crossover  point  increases  to  n»20  and  n»19 
clusters  in  the  lover  and  upper  bound  graphs  respectively.  In  all  cases, 
as  n  increases,  the  singly  and  centrally  linked  systeas  tend  to  parallel 
each  other  while  the  singly  and  fully  linked  systeas  tend  to  diverge. 

The  fully  linked  baseline  case  aodelled  in  section  3.1  vas 
undeslreable  in  teras  of  the  application  reliability  requireaent. 
Modifying  the  topology  to  a  singly  linked  systea  does  not  affect  the 
reliability  of  the  FTPP  significantly  enough  to  aake  a  difference  in  teras 


of  this  requireaent. 
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Figure  3.3-4:  Systes  Unreliability  for  3  Topologies  and  K  Clusters 
(HTTF*60000  hours  and  Lover  Bound  Assuaptions). 
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Figure  3.3-3:  Systea  Unreliability  for  3  Topologies  and  N  Clusters 
( HTTF *60000  hours  and  Upper  Bound  Assuaptions). 
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Figure  3.3-6:  System  Unreliability  for  3  Topologies  and  N 
<HTTF« 100000  hours  and  Lover  Bound  Assuaptions ) . 
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Figure  3.3-7:  System  Unreliability  for  3  Topologies  and  N  Clusters 
(NTTF> 100000  hours  and  Upper  Bound  Assuaptions). 
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3. 3. 2  Cluster  Redundancy  Strategies 


Section  3.3.1  coapared  various  cluster  topologies  under  the 
assuaption  that  any  cluster  failure  or  isolation  constituted  systea 
failure.  This  secti  on  exaaines  the  case  where  soae  cluster 
failures/isolations  are  tolerated.  A  cluster  failure/isolation  aay  be 
acceptable  in  a  systea  where  there  are  redundant  clusters  or  where 
degradation  of  systea  throughput  is  acceptable. 

3. 3. 2.1  Toleration  of  Cluster  Failure/Isolation 

The  aethods  and  equations  of  section  3. 3. 1  will  be  eaployed  to  derive 
the  reliability  of  a  FTPP  systea  of  n  cluaters  where  1  cluater  failure  or 
isolation  is  tolerated.  The  unreliability  of  the  systea  (Q(S>)  now 

becoaes: 

Q<S>  -  CP(0  ISOLATIONS)  x  P<2  OR  MORE  CLUSTERS  FAIL) ]♦ 

CP<1  ISOLATION )x(P< 2  OR  MORE  CLUSTERS  FAIL) ♦ ( ( n-1 ) /n)P( 1  CLUSTER  FAILS)))* 

CP<2  OR  MORE  ISOLATIONS))  (3.3-42) 

The  ((n-l)/n)  factor  is  used  to  take  into  account  the  case  where  the 
isolated  cluster  is  also  the  failed  cluster.  Host  of  the  xeras  of 
equation  3.3-42  have  already  been  derived. 

P(0  ISOLATIONS)  »  1-P< ISOLATION)  (3.3-43) 

P( ISOLATION)  is  the  probability  of  at  least  one  cluster  isolation  and  has 
been  derived  for  the  various  topologies  in  section  3.3-1: 

P(2  OR  HORE  CLUSTERS  FAIL)  *  1  -  (RC*  ♦  nRC* ' ' ( 1 -RC ) )  (3.3-44) 


R"  ««  pyfF  «.»  M  y  r  !.■  ITT*  «,'  r  V  V  ■»  l>  V  <■'  V  WWV^VlV^w.'iWJWTiy 


P<1  CLUSTER  FAILS)  *  nRC*-‘(l-RC)  (3.3-45) 

RC  la  tha  raliability  of  •  cluster,  neglecting  the  effects  of  the  input/ 
output  elements,  and  Is  defined  in  equation  3.3-2.  The  tvo  resaining 

terse  are  related: 

P<2  OR  MORE  ISOLATIONS)  *  P< ISOLATION)  -  P(1  ISOLATION)  (3.3-46) 
The  remainder  of  this  section  is  devoted  to  calculating  the  probability  of 
exactly  one  isolation  P(1  ISOLATION)  for  the  various  topologies. 

For  the  centrally  linked  systea,  to  find  P<  1  ISOLATION),  the 
probability  that  one  of  the  five  distributed  clusters  becoses  isolated  is 
calculated.  Since  this  say  occur  in  5  different  ways,  the  resultant 
probability  is  aultiplled  by  5.  Decomposing  on  the  input/output  elements 
of  the  central  cluster  yields  the  equation: 

• 

P(1  ISOLATION)  ■  5  E  CPU  ISOLATION/N  10  W0RK)xP(N  10  WORK)]  (3.3-47) 

■  •  • 

P(N  10  WORK)  is  the  probability  that  exactly  N  out  of  5  input/output 
eleaents  of  the  central  cluster  operate,  and  can  be  calculated 
coablnatorially  using  equation  (3.1-5).  The  probability  of  exactly  one 
isolation,  given  that  N  input/output  elements  of  the  central  cluster 
operate,  is  the  probability  that  there  are  at  least  tvo  operational  lines 
to  four  of  the  distributed  clusters  and  less  than  tvo  operational  lines  to 
one  of  the  distributed  clusters. 

P<1  ISOLATION/N  10  WORK)  *  P(at  least  2  of  N  10  WORK)4  x 

P( less  than  2  of  N  10  WORK)  (3. 3-48) 

P(at  least  2  of  N  10  WORK)  is  found  using  equation  3.1.14  and  substituting 
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RIOD  and  N  lor  RIO  and  X  respectively.  P(less  than  2  10  WORK)  is  siaply 
the  coapleaent  of  P(at  least  2  of  N  10  WORK).  To  generalize  for  n,  the  5 
preceding  the  suaaation  in  equation  3.3-47  becoaes  n-1  and  the  exponent  on 
equation  3.3-48  becoaes  n-2. 

For  the  fully  linked  systea,  clusters  are  assuaed  independent.  The 
probability  that  an  arbitrary  cluster  is  isolated,  P ( 1  CLUSTER  ISOLATION), 
has  already  been  defined  by  a  lover  and  upper  bound  in  section  3. 1. 1.  The 
probability  that  one  ox  six  clusters  becoaes  isolated  becoaes: 

P<1  ISOLATION)  =  61 1-P< 1  CLUSTER  ISOLATION)]*  x 

CP(1  CLUSTER  ISOLATION)]  (3.3-49) 

P(1  ISOLATION)  will  also  be  defined  by  a  lover  and  upper  bound  depending 
on  the  values  of  P(1  CLUSTER  ISOLATION)  used  in  equation  3.3-49.  To 
generalize  for  n,  the  6  becoaes  an  n  and  the  exponent  becoaes  n-1  in 
equation  3. 3-49. 

For  the  singly  linked  systea,  alloving  a  cluster  isolation  aeans  that 
any  tvo  adjacent  link  failures  are  tolerated.  To  find  P(1  ISOLATION),  the 
probability  that  tvo  adjacent  links  fail  is  calculated.  Since  this  aay 
occur  in  6  different  vays,  the  resultant  probability  is  aultlplied  by  6. 
Arbitrarily  selecting  links  2  and  3  as  failed  links  yields: 

P(1  ISOLATION)  =  6xP ( LI ) xP ( L2/L1 ) xP ( L3/L1  0  L2)xP(L4/Ll  0  L2  f)  L3)x 
P(L5/L1  n  L2  n  L3  n  L4)xP(L6/Ll  D  L2  D  L3  0  L4  D  L5)  (3.3-50) 

All  terew  in  equation  3.3-50  have  been  calculated  in  sectic  »  3.  3.  1.3  vith 
the  exception  of  P(L3/L1  D  L2)  vhich  is  the  probability  that  a  link  fails, 
given  that  an  adjacent  link  fails.  Using  conditional  probability: 


P<L3/L2>  ■  P(L3  n  L2)  /  P<L2> 


(3.  3-51) 


P(L2)  la  the  complement  of  P(L1)  which  hma  been  calculated  previously. 
The  probability  link  3  falls  and  link  2  fails  can  be  found  by  decomposing 
on  the  input/output  elements  of  the  shared  cluster  (cluster  3). 

_  _  *  _  _ 

P(L3  n  L2)  =  E  CP(L3  0  L2/H  10  WORK)  x  P(N  10  WORK)]  (3.3-52) 

■  •  • 

where : 

P(L3  n  L2/N  10  WORK)  «  P(at  least  N-l  of  M  10  of  C2  PAIL)  x 

P(at  least  N-l  of  N  10  of  C4  FAIL)  (3.3-53) 

P(N  10  WORK)  and  P(at  least  N-l  of  N  10  FAIL)  have  already  been  calculated 
in  equations  3.1-5  and  3.3-37.  To  generalize  for  n,  equation  3.3-50 
becomes: 

P(1  ISOLATION)  *  n  x  P(L)  x  P(L/L)  x  P(L/L>  x  P(L/L)  x  P(L/L>“-»  x 
P(L5/L  0  L)  (3.3-54) 

The  unreliabilities  of  the  three  topologies  with  a  redundant  cluster  were 
programmed  in  FORTRAN.  Comparisons  were  made  for  the  baseline  case  of  6 
clusters  plus  1  redundant  cluster  (the  preceding  derivation  examined  5 
clusters  plus  1  redundant  cluster).  The  system  unreliabilities  using  both 
lover  and  upper  bound  assumptions  are  depicted  graphically  in  figures  3.3- 
8  and  3. 3-9  respectively.  Using  lover  bound  assumptions,  the  fully  linked 
system  is  most  reliable  followed  by  the  singly  and  centrally  linked 
systems  respectively.  Using  upper  bound  assumptions,  the  singly  and  fully 
linked  systems  are  nearly  identical  while  the  centrally  linked  is  least 


reliable. 


The  rav  data  shove  the  singly  linked  systea  slightly  sore 


reliable.  While  the  assuaptions  of  perfect  and  laperfect  processor 
coverage  should  not  affect  the  relative  reliabilities  of  the  topologies, 
the  change  can  be  accounted  for  by  the  fact  that  the  probability  of 
isolation  for  the  fully  linked  systea  was  calculated  vith  the  additional 
upper  and  lover  bounds  concerning  input/ output  eleaents  described  in 
section  3. 3. 1. 

With  a  single  redundant  cluster,  using  lover  bound  assuaptions  and 
component  HTTFs  under  10*  hours  (10*  hours  for  netvork  eleaents),  the 
fully  linked  systea  is  able  to  aeet  the  duplex  cosputer  reliability 
requireaent.  The  singly  linked  syates  is  able  to  aeet  the  triplex 
cosputer  requireaent  and  the  centrally  linked  systea  is  able  to  seet  the 
simplex  cosputer  requireaent.  Using  upper  bound  assuaptions  under  the 
saae  conditions,  all  three  topologies  cones  closer  to,  but  do  not  aeet, 
the  quadruplex  coaputer  requireaent. 
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Figure  3.3-8:  Systea  Unreliability  for  3  Topologies 
(6  Clusters  plus  1  Spare  and  Lover  Bound  Assumptions) . 
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Figure  3.3-9:  Systea  Unreliability  for  3  Topologies 
<6  Clusters  plus  1  Spare  and  Upper  Bound  Assuaptions) . 
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3. 3. 2. 2  Replication  of  Clusters 

The  previous  section  (3. 3. 2.1)  exaained  the  case  vhere  a  single 
cluster  failure/isolation  vas  tolerated  in  a  six  cluster  systea.  In  teras 
of  systea  reliability  calculations,  this  is  identical  to  the  case  of  a 
five  cluster  systea  plus  one  redundant  cluster.  The  aethods  eaployed  in 
the  previous  section  aay  be  used  to  deteraine  the  reliability  of  a  systea 
with  N  redundant  clusters.  These  aethods  produce  exceedingly  elaborate 
equations  as  the  nuaber  of  redundant  clusters  Increases.  As  an  exaaple: 
for  a  six  cluster  systea  with  tvo  redundant  clusters,  the  systea* s 
unreliability  (QS)  becoaes: 

OS  «  tP(0  ISOLATIONS )xP< 3  OR  MORE  CLUSTERS  FAIL) ]♦ 

CPU  ISOLATION )x(P<3  OR  HORE  CLUSTERS  FAIL)*<21/28)P(2  CLUSTERS  FAIL))]* 

CP <2  ISOLATIONS ) x ( P ( 3  OR  HORE  CLUSTERS  FAIL) ♦  ( 27/28 >P< 2  CLUSTERS  FAIL) * 
(6/8) P< 1  CLUSTER  FAILS) > I ♦ C P < 3  OR  HORE  ISOLATIONS)]  (3.3-55) 

P(3  OR  HORE  ISOLATIONS)  has  not  been  calculated  previously  and  will  result 
in  elaborate,  though  aanageable,  equations. 

A  relatively  siaple  case  to  exaaine  is  the  fully  connected  baseline 
systea  vhere  the  assuaption  of  cluster  independence  peraits  the  systea 
reliability  (RS)  to  be  expressed  as  a  function  of  the  nuaber  of  redundant 
clusters  (N): 


vhere: 


RC  -  Reliability  of  a  single  cluster. 


Lover  and  upper  bounds  can  be  calculated  by  using  the  lover  and  upper 


bound  calculations  on  cluster  reliability  respectively. 


The  systea  unreliability  of  the  fully  connected  baseline  syetea  plus 


N  redundant  clusters,  using  both  lover  and  upper  bound  assuaptions,  is 


depicted  graphically  in  figures  3.3-10  and  3.3-11.  These  figures  shov 


that  for  the  given  values  of  N,  the  systea  reliability  increases  as  aore 


clusters  are  added  to  the  systea,  but  at  a  decreasing  rate.  As  aore 


clusters  are  added,  the  coaplexity  and  unreliability  of  the  input/output 


eleaenta  increase.  This  effect  becoaes  increasingly  doainant  as  N 


increases  and  raises  the  question  as  to  vhether  there  is  a  liait  to  the 


nuaber  of  redundant  clusters  that  can  be  added  to  a  fully  connected  systea 


and  still  be  able  to  increase  systea  reliability. 


Figures  3.3-12  and  3.3-13  depict  the  systea  unreliability  as  a 


function  of  the  nuaber  of  redundant  clusters,  with  six  clusters  required 


to  perfora  the  assigned  tasks,  for  the  fully  connected  baseline  systea 


using  both  lover  and  upper  bound  assuaptions  respectively.  The  x-axis 


scale  depicts  the  total  nuaber  of  clusters  in  the  systea.  Curves  vere 


generated  for  HTTFs  of  20000,  25000  and  30000  hours.  The  curves  shov 


there  is  indeed  a  liait  to  the  nuaber  of  redundant  clusters  vhich  can 


increase  systea  reliability.  Using  lover  bound  assuaptions,  the  limit  is 


approxiaately  10  redundant  clusters  for  HTTF=20000  hours,  13  clusters  for 


MTTF*25000  hours,  and  16  clusters  for  HTTF=30000  hours.  Using  upper  bound 
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aaauaptlona,  tha  Halt  ia  approx  iaat  al  y  12  radundant  cluat-rr#  for 
NTTF ■ 20000  houra,  17  cluatara  lor  RTTF*23000  hour#,  and  20  cluatara  fur 
HTTP  *30000  houra.  Tha  Halt  lncraaaaa  aa  t  ha  NTTFa  ol  tha  ayataa 
coaponanta  incraaaa.  Tha  unraliablll ty  curvaa  axhlblt  tha  allact  of 
dialniahing  aarglnal  raturna.  Each  additional  radundant  cluatar  producaa 
a  aaallar  Incraaaa  in  raliabllity  than  tha  pravioua  ona  (on  a  log  acala). 

Ualng  lovar  bound  aaauaptlona  and  coaponant  HTTFa  undar  10*  houra 
(10*  houra  lor  natvork  alaaanta),  tha  lully  linkad  ayataa  la  abla  to  aaat 
tha  aiaplax  coaputar  raiiabiiity  raquiraaant  with  tvo  radundant  cluatara. 
Ualng  uppar  bound  aaauaptlona  undar  tha  aaaa  condltlona,  tha  lully  linkad 
ayataa  la  abla  to  aaat  tha  trlplax  coaputar  raquiraaant  with  tvo  radundant 
cluatar  and  tha  duplax  coaputar  raquiraaant  with  lour  radundant  cluatara. 
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Figure  3. 3-10:  System  Unreliability  for  Baseline  Systea  with 
Redundant  Cluatera  (Lover  Bound  Assumptions). 
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Figure  3.3-11:  System  Unreliability  for  Baseline  System  vith 
Redundant  Clusters  (Upper  Bound  Assumptions). 


3. 3. 2. 3  Distribution  oi  Tasks  Among  Added  Clusters 

Section  3.2.3  exaained  the  change  in  cluster  reliability  as  a  result 
oi  assigning  different  numbers  of  tasks  to  the  cluster.  This  section  will 
exaaine  the  systems  effect  of  decreasing  the  number  of  tasks  per  cluster 
and  using  additional  clusters  in  order  to  keep  the  system  throughput 
conmtant.  As  was  discussed  in  section  3.2.3,  using  fever  tasks  per 
cluster  tends  to  Increase  cluster  reliability.  Adding  more  clusters  to 
the  system,  however,  tends  to  decrease  system  reliability.  Four  FTPP 
configurations  will  be  examined  using  the  FTPP  system  model  constructed 
for  the  fully  linked  baseline  system  (section  3.1).  The  first 
configuration  is  the  baseline  case  of  six  clusters,  three  of  which  are 
assigned  four  tasks  and  three  of  which  are  assigned  five  tasks.  The 
second  configuration  consists  of  seven  clusters,  each  of  which  is  assigned 
four  tasks.  The  third  configuration  consists  of  eleven  clusters,  ten  of 
which  are  assigned  three  tasks  and  one  of  which  is  assigned  two  tasks. 
The  fourth  configuration  consists  of  twenty-one  clusters,  each  of  which  is 
assigned  two  tasks.  In  each  case,  the  four  configurations  perform  20 
computational  tasks  and  are,  therefore,  capable  of  the  same  throughput. 
The  difference  between  the  four  configurations  is  the  way  the  tasks  are 
distributed  throughout  the  system. 

The  system  reliability  was  calculated  for  the  assumptions  of  perfect 
and  imperfect  processor  coverages  discussed  in  section  3.1.2.  Results  for 
both  lover  and  upper  bound  assumptions  are  depicted  graphically  in  figures 


3.3-14  and  3.3-13  respectively.  Using  lover  bound  assuaptions,  the  six 


cluster  systes  is  aost  reliable  followed  by  the  seven  cluster,  eleven 
cluster,  and  tventy-one  cluster  systess.  Using  upper  bound  assuaptions, 
the  eleven  cluster  sywtea  is  sost  reliable  followed  by  the  seven  cluster, 
six  cluster  and  tventy-one  cluster  sys teas.  In  both  cases,  all  curves 
tend  to  diverge  fros  one  another  with  the  exception  of  the  tventy-one 
cluster  curve  which  tends  to  converge  with  all  other  curves. 

Using  lover  bound  assuaptions  and  coaponent  HTTFs  under  10*  hours 
(10*  hours  for  network  eleaents),  the  distribution  of  tasks  aaong  added 
clusters  decrease  systes  reliability  to  the  point  where  not  even  the 
quadruples  coaputer  reliability  requireaent  is  aet  for  the  tventy-one 
cluster  case.  Using  upper  bound  assuaptions  under  the  saae  conditions, 
only  the  eleven  cluster  systea  coses  close  to  aeeting  the  quadruples 


coaputer  requireaent 
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Figure  3.3-14:  Syatea  Unreliability  for  20  Task  N  Cluster  Syetea 
(Lover  Bound  Aeauaptione) . 
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CHAPTER  4 


CONCLUS I OHS / RECOMMEND A T I ONS 


4. 1  CONCLUSIONS 


This  thssis  hss  modelled  ■  FTPP  systss  srchitscturs  based  upon  a 
potsntlsi  spscs  systss  spplicstion  rsquirsssnt.  Ths  srchitscturs  utiiizss 
tour  bssic  building  block  cosponsnts:  sssory  slsssnts,  procsssor  slsnsnts, 
network  slsssnts.  snd  input/output  slsssnts.  Ths  FTPP  architsctursl  and 
rsdundancy  stratsgiss  wars  psrturbsd  in  attsspts  to  dsfins  a  sors  reliable 
systss.  Thsss  perturbations  producsd  r  si  at  ionshi  pa  between  FTPP 
reliability,  throughput,  and  architecture.  A  description  of  these 
relationships  and  ths  aodelling  techniques  used  to  derive  then  should 
prove  useful  to  a  systss -designer  attempting  to  seet  a  specific 
application  requirement. 

In  good  design  practice,  the  choice  of  reliable  components  is 
essential  (61.  This  la  particularly  true  for  the  FTPP  processor  and 
network  elements,  which  provide  the  greatest  delta  increase  in  cluster 
reliability  of  the  four  FTPP  building  blocks.  Since  processor  elements 
provide  a  greater  delta  increase  when  upper  bound  processor  assumptions 
are  used  (upper  bound  asauaptions  will  sore  likely  predict  actual 
processor  behavior  than  will  lower  bound  assustions)  and  since  network 


eleaents  ire  the  siaplest  components  used  in  the  FTPP  and  will  probably  be 
the  aost  difficult  eleaent  to  iaprove  upon  in  teras  of  failure  rate,  the 
use  of  the  aost  reliable  processor  eleaents  should  be  particularly 
stressed  in  the  design  phase.  The  possibility  of  improving  network 
eleaent  reliability  through  duplication  is  being  considered. 

In  defining  a  cluster  architecture,  the  nuaber  of  processor  eleaents 
per  network  and  aeaory  auat  be  chosen.  There  exists  an  optiaua  nuaber  of 
processors  that  will  aaxiaize  cluster  reliability,  and  as  coaponent 
failure  rate  decreases,  the  optiaua  nuaber  of  processors  will  also 
decrease.  For  the  range  of  paraaeters  exaained  in  this  thesis,  the 
optiaal  nuaber  of  proceuors  was  four  for  coaponent  HTTFs  below  20000 
hours  (200000  hours  for  network  eleaents)  and  three  for  HTTFs  above  20000 
hours.  Before  defining  an  architecture,  the  designer  should  perform  an 
optiaization  for  the  coaponent  paraaeters  used.  The  optiaization  will 
depend  on  the  coaponent  failure  rates,  processor  coverage,  and  the  degree 
that  added  coaponent  coaplexity  affects  coaponent  failure  rate. 

Both  cluster  controllers  and  global  controllers  represent  reliability 
bottlenecks  that  deserve  special  attention.  A  cluster  can  not  tolerate  a 
cluster  controller  failure,  and  the  systea  can  not  “olerate  a  global 
controller  failure.  In  a  systea  with  realistic  coaputational  speedup 
assuaptions,  the  loss  of  a  set  of  processors  assigned  a  coaputational  task 
represents  a  relatively  saall  decrease  in  systea  throughput  which  the 
systea  should  tolerate  (the  nuaber  of  coaputational  tasks  is  equal  to  the 
nuaber  of  parallel  coaputational  paths  the  FTPP  job  is  partitioned  into). 


1W-fll83  8(7  RELIA8ILITV  HOOELLING  FOR  FAULT-TOLERANT  PARALLEL 

PROCESSOR  (U)  AIR  FORCE  INST  OF  TECH  URIGHT-PATTERSON 
AFB  OH  El  GJERHUNDSEN  JAN  87  AFIT/CI/NR/87-31T 
UNCLASSIFIED  F/G  12/< 


>  U'lJ>|J'feir|«i'|JI‘  tat  |.4  l.i  1.1  t-tl  1.4 'tat  *'■  WVX*  fa"  Jk  4  W  y,  A  t'U'liffM>^,<dl.lrf/l>,t 


Therefore,  processors  performing  computational  tasks  should  be  able  to 
perform  a  controller  task  should  a  controller  experience  degradation. 
Also,  to  prevent  single  point  failures,  tasks  should  be  dispersed  to  the 
fullest  extent  possible  among  processor  elements  attached  to  different 
memory  and  network  elements.  Supplementing  controllers  with  additional 
processors  and  dispersing  tasks  results  in  a  more  graceful  degradation  of 
the  FTPP. 

The  choice  of  cluster  topology  not  only  affects  system  reliability  but 
also  system  throughput,  maintainability,  and  modularity.  Table  4.1-1 
depicts  the  relative  rankings  of  the  three  topologies  as  examined  for  the 
baseline  case  of  six  clusters.  A  ranking  of  one  represents  the  most 
desirable  system,  and  a  ranking  of  three  represents  the  least  desirable. 
It  is  left  to  the  system  designer  to  determine  the  relative  weight  of  each 
attribute.  In  the  baseline  case,  the  singly  linked  system  is  the  most 
reliable  followed  closely  by  the  fully  linked  system,  and  then  by  the 
centrally  linked  system.  As  the  number  of  clusters  in  the  system 
increases,  the  singly  linked  system  stays  most  reliable,  but  with  an 
increasing  margin  over  the  second  best,  while  the  fully  and  centrally 
linked  systems  switch  relative  positions.  The  number  of  clusters,  at 
which  the  centrally  linked  system  becomes  more  reliable  than  the  fully 
linked  system,  increases  as  the  component  failure  rates  decrease.  The 
addition  of  a  redundant  cluster  to  the  baseline  case  favors  the  fully 
linked  system  over  the  singly  linked  system. 
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Centrally 

Linked 

Fully 

Linked 

Singly 

Linked 

Throughput 

2 

1 

3 

Maintainability 

1 

3 

2 

Modularity 

1 

3 

2 

Reliability 

3 

2 

1 

Table  4.1-1:  Relative  Rankings  of  Topologies  for  Six  Cluster  Systea. 

Different  reliability  analysis  techniques  are  sore  suitable  for 
specific  topologies.  The  assumption  of  cluster  independence  can  be  made 
for  the  fully  linked  system  with  little  loss  in  the  accuracy  of  the 
reliability  calculations.  Using  the  assumption  of  cluster  independence, 
the  reliability  of  a  fully  linked  system  can  be  easily  estimated  for  any 

I 

arbitrary  system  of  n  clusters  with  x  redundant  clusters.  The  reliability 
of  a  centrally  linked  system  can  be  easily  calculated  for  a  system  of  n 
clusters  by  decomposing  on  the  input/output  elements  of  the  central 

\  cluster  to  find  the  reliability  of  the  system  of  input/output  elements. 
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Thia  reliability  together  with  an  estimate  of  the  cluster's  reliability 
(without  input/output  elements)  are  used  to  estimate  the  system's 
reliability.  The  singly  linked  system  is  the  most  difficult  to  analyze, 
but  can  be  accomplished  by  using  conditional  probability,  since  only 
adjacent  cluster  links  are  dependent.  Like  the  centrally  linked  system, 
the  reliability  of  the  system  of  input/output  elements  together  with  an 
estimate  of  the  cluster's  reliability  can  be  used  to  estimate  the  system's 
reliability.  The  same  methods  can  be  used  to  calculate  the  reliability  of 
a  centrally  or  singly  linked  system  with  redundant  clusters.  These 
calculations,  however,  are  not  suited  to  a  system  with  more  than  a  few 
redundant  clusters  due  to  the  increased  complexity  of  the  calculations. 

Additional  clusters  in  an  FTPP  system  nay  be  used  in  two  ways:  to 
decrease  the  task  load  per  cluster,  or  to  serve  as  spares.  Using 
additional  clusters  to  decrease  the  tasks  pe.'  cluster  is  a  passive  method 
which  requires  only  an  initial  assignment  of  tasks,  while  using  additional 
clusters  as  spares  is  an  active  method  which  requires  the  migration  of  the 
tasks  of  one  cluster  to  another.  From  the  viewpoint  of  hardware, 
reliability  modelling  shows  redundant  clusters  to  be  clearly  more 
effective  in  increasing  system  reliability.  On  the  other  hand,  from  the 
viewpoint  of  software,  redundant  clusters  require  more  complex  programs 
thus  decreasing  software  reliability.  There  is  a  limit  to  the  number  of 
redundant  clusters  that  can  be  added  to  a  system  and  still  increase  system 
reliability  in  a  fully  linked  system,  but  this  limit  is  high  enough  that 
it  should  not  be  a  concern  in  a  practical  system  design. 


The  application  reliability  and  throughput  requireaenta  are  extremely 
stringent.  The  application  reliability  requirement  can  not  be  act  without 
using  redundant  clusters  or  components  whose  failure  rates  are 
unattainable  in  the  foreseeable  future.  Using  lower  bound  assumptions, 
the  reliability  requirement  for  a  simplex  computer  can  be  met  with  the 
addition  of  two  redundant  clusters  to  the  baseline  system  using  components 
with  HTTFs  under  10*  hours  <10*  hours  for  network  elements).  Using  upper 
bound  assumptions,  the  reliability  requirement  for  a  triplex  computer 
system  can  be  met  with  the  sane  redundancy  level  and  component  HTTFs.  The 
application  throughput  requirement  can  be  net  relatively  easily  by  using 
twenty  sets  of  S  HIPS  processors  if  ideal  speedup  assumptions  are  made. 
Using  the  speedup  bounds  of  Hwang  and  Briggs  would  require  between  ninety 
and  a  million  plus  sets  of  processors.  The  efficiency  in  the  partitioning 
of  software  for  use  in  a  parallel  system  is  clearly  important  and  may  have 
a  profound  effect  on  the  system  architecture  and  performance. 


4.2  RECOMMENDATIONS  FOR  FURTHER  RESEARCH 


1.  The  combinatorial  models  used  in  this  thesis  are  unable  to  model 
simultaneous  failures.  The  effect  of  a  component  failing  during  a 
reconfiguration  can  be  represented  using  Markov  Modelling.  The  number  of 
states  required  to  model  the  FTPP,  however,  would  be  prohibitively  large 
and  simplifications  would  have  to  be  made. 

2.  A  simpler  way  is  needed  to  calculate  or  estimate  the  reliability  of 
non-fully  connected  topologies  of  systems  with  redundant  clusters. 

3.  This  thesis  has  modelled  three  basic  FTPP  topologies;  modelling 
'hybrid'  topologies,  which  combine  the  advantages  and  disadvantages  of 
these  basic  topologies,  would  provide  the  FTPP  designer  with  further 
options. 

4.  This  thesis  has  neglected  the  reliability  effects  of  software. 
Techniques  for  computing  software  reliability  are  less  advanced  than  those 
of  computer  hardware  reliability,  yet  both  are  equally  Important  in  the 
calculation  of  a  system's  reliability. 
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